Our series of blog posts on optimizing embedded software with the aim of improving (i.e. reducing) worst-case execution times continues with loop unrolling.
Software optimization techniques which improve worst-case execution times #7: Loop Unrolling
Improvements in worst-case execution time can usually be gained by unrolling loops. Often an appropriate compiler option can be used to unroll loops, for example the gcc options –funroll-loops unrolls loops when the number of iterations is constant whilst –funroll-all-loops unrolls all loops.
Sometimes it can be worthwhile unrolling loops by hand as loop unrolling often leads to further opportunities for optimisation.
Example: Loop
Uint32 csb3(Uint32 x) { Uint32 set_bits = 0; while(x) { if(x & 0x1) set_bits++; x >>= 1; } return set_bits; }
In the example above, the function csb3() counts the number of set bits in a 32-bit unsigned integer: x. The termination condition checks if there are any set bits left in x (non-zero value). If so, the least significant bit is checked and x shifted right, so the next most significant bit can be checked, and so on. The worst-case execution time of this implementation occurs when the value passed is 0xFFFFFFFFU. In this case, the loop iterates 32 times with 33 evaluations of the loop condition.
Partially unrolling the loop results in an execution time that may be longer in the average case, as there are fewer opportunities for an early exit, but shorter in the worst-case. In the worst-case, the revised loop iterates just 4 times with 5 evaluations of the loop termination condition.
Unrolling the loop leads directly to further optimisation: rather than include the operation x >>= 1; four times, and compare x against 0x1, the code has been modified to use one shift by 4 and constants that equate to set bits in bit positions 0 to 3. This reduces the total number of shift operations required from 32 to 8.
Example: Partially unrolled loop
Uint32 csb4(Uint32 x) { Uint32 set_bits = 0; while(x) { if(x & 0x1) set_bits++; if(x & 0x2) set_bits++; if(x & 0x4) set_bits++; if(x & 0x8) set_bits++; x >>= 4; } return set_bits; }
Together with the loop unrolling, the reduction in the number of shift operations leads to a significant reduction in the worst-case execution time for the MPC555. However, the HC12 only has operations to shift by one bit at a time and thus the “optimised” code actually ends up resulting in a longer execution time.
Function | WCET (clock ticks) | |
MPC555 | HC12 | |
csb3 | 272 | 1140 |
csb4 | 151 | 1180 |
Reduction in WCET | 44.5% | -3.55% |