The latest in the series of blog posts on optimizing embedded software with the aim of improving (i.e. reducing) worst-case execution times focuses on single path code.
Software optimization techniques which improve worst-case execution times #9: Single Path Code
Improved worst-case execution times can often be achieved by modifying functions or blocks of code so that they have a single execution path. This is particularly relevant when the code is executed on advanced high performance processors. Here instruction pipelines greatly improve execution speed for sections of code that can be executed without branching.
Single path code has the added advantage that obtaining adequate test coverage is far simpler than it is for data dependent code with multiple paths.
The worst-case execution time of a function or block of code can easily be compromised by code constructs that are aimed at improving average case behaviour. This is typically the case when an early exit is provided for non-worst-case data. However testing for an early exit in the average case compromises worst-case execution time.
The following example represents a further improvement to the worst-case execution time of the “count set bits” function. Here the loop has been removed and a series of table look-ups used to determine the number of set bits in each 4-bit section of the 32-bit value. The function now has a single execution path.
Example: Single path
const Uint8 bits_table[] = {0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4}; Uint32 csb6(Uint32 x) { return bits_table[(x) & 0xF] + bits_table[(x >> 4) & 0xF] + bits_table[(x >> 8) & 0xF] + bits_table[(x >> 12) & 0xF] + bits_table[(x >> 16) & 0xF] + bits_table[(x >> 20) & 0xF] + bits_table[(x >> 24) & 0xF] + bits_table[(x >> 28)]; }
Using a larger lookup table with say 256 entries would mean that only 4 array look-ups were needed, one for each byte of the 32-bit word. This would further improve the worst-case execution time at the expense of using more memory: a typical memory versus speed trade-off. Interestingly, a larger look-up table could be well worthwhile if the function was in-lined in a number of places. Here the reduction in overall code size could exceed the increase in the size of the look-up table.
It is worth noting that the use of very large look-up tables can on some processors result in more cache misses and hence an increased worst-case execution time.
The table below shows the worst-case execution times for different implementations of the “count set bits” function.
Function | WCET (clock ticks) | |
MPC555 | HC12 | |
csb3 | 272 | 1140 |
csb4 | 151 | 1180 |
csb5 | 86 | 1028 |
csb6 | 52 | 2080 |
csb7 | 32 | 80 |
Reduction in WCET (3-7) | 88.2% | 93% |
The final implementation, csb7, is similar to csb6 but uses an 8-bit look-up table. The reduction in worst-case execution time between the initial implementation using a while loop and the final implementation using an 8-bit look-up table is a factor of 8 for the MPC555 and a factor of 14 for the HC12.