EM•Mark execution time – TI CC2340R5
While we normally must choose between optimize-for-size and optimize-for-speed options when compiling application programs, our second set of EM•Mark results will confirm that small(er) programs can execute fast(er) – a key premise of the EM language and runtime.
Program size matters
Our last blog post reported a 60% reduction in total program size between legacy CoreMark [15233 bytes of C code] versus our EM•Mark re-design [5792 bytes of EM code] – in large part due to the WPO + ACO optimizations employed by the latter.
But exactly why would a smaller program execute faster in this particular context :
a smaller program might simply execute fewer instructions
a smaller program might utilize on-chip cache more efficiently
a smaller program might fit entirely into (fast) on-chip SRAM
Treating the TI CC2340R5 [48 MHz Cortex-M0+, 512K Flash, 36K SRAM] as a "typical" MCU targeted by EM, these sorts of mainstream devices invariably feature an on-chip cache to mitigate the impact of a fast(er) CPU fetching instructions from a slow(er) Flash memory.(1)
- otherwise requiring several CPU wait-states which would
proportionally degrade overall program execution time
Implemented as zero wait-state SRAM, a large cache capable of holding the entire program image would (in theory) achieve maximum execution throughput.
In practice, however, an MCU cache can typically hold only 4K to 16K of program code – in part to minimize the impact of the cache block on silicon size and cost.
With its innate ability to reduce executable code size, EM compels us to place the entire application program into on-chip SRAM – memory already present within our MCU. Said another way, we could explicitly manage SRAM as an application-specific cache.(1)
- suggestive of program overlays , for those
of you old enough to remember
We've already reported a significant [60%] decrease in overall program size; let's now quantify whether our smaller EM•Mark program can in fact execute faster than legacy CoreMark.
The envelope, please ...
Using the setup described in EM•Mark Results, the following table summarizes the execution times of CoreMark versus EM•Mark in different memory configurations:
CoreMark | text + const [ Flash ] | 176 ms |
EM•Mark | text + const [ Flash ] | 151 ms |
EM•Mark | text + const [ SRAM ] | 124 ms |
These figures represent the time (in milleseconds) required to run 10 iterations of the underlying benchmark algorithms, and do not include any time consumed when initializing the input data or when displaying the output results.
But doesn't legacy CoreMark calculate its score differently ???
Not really .... Since CoreMark relies on a millisecond-resolution on-chip timer to actually measure execution time, EEMBC requires that the program run for at least 10 seconds – requiring 1000s of benchmark iterations. From here, we can calculate CM / MHz where CM equals the number of iterations per second.
Using a Saleae Logic Analyzer , we verified that time-per-iteration remains constant – regardless of whether we perform 10s, 100s, or 1000s of benchmark iterations. Using the three execution times reported earlier for the 48 MHz TI CC2340R5 [ 176 ms, 151 ms, 124 ms ] , we arrive at normalized CM / MHz scores of [ 1.18, 1.38, 1.68 ] .
TI in fact reports a CM / MHz score of 2.19 – but using a CoreMark program compiled with the most aggressive optimize-for-speed options. While clearly twice as fast, its executable code has also doubled in size – with the total program image now consuming more than 33 K of memory !!!
Aware that we'll sacrifice some execution time, our CoreMark and EM•Mark programs rely upon the underlying compiler's most aggressive optimize-for-space options – following the practice of most embedded software developers.
More important than achieving the highest absolute score, focus instead on the 14% reduction in execution time between CoreMark [ 176 ms ] and EM•Mark [ 151 ms ] – with both programs executing from Flash with the on-chip cache enabled.(1)
- disabling the cache would probably degrade
performance by at least a factor of 3
But then let's factor in the additional 18% of performance improvement that can accrue by (simply !!) executing EM•Mark entirely from SRAM – suggesting that future MCUs could eliminate the on-chip cache [ lower cost ] as well as increase the CPU clock speed [ more MIPS ] .
code size does matter :: tiny(er) code ⇒ tiny(er) chips
How can you get involved
consider our Tiny code → Tiny chips premise, in light of the SRAM results reported here
explore this Benchmark configuration material, to understand how we reduced program size
try running ActiveRunnerP
with different compiler options – or even with different compilers
Happy coding !!!