The AI MCU: Benchmarking data shows Alif’s superior MCU architecture provides better AI performance than conventional MCUs offer
Introduction
Design engineers who work with microcontrollers tend to find that they quickly run up against the barriers of performance and power when they try to add AI functionality to a typical embedded control system.
The levers to pull when facing performance constraints in a conventional control application are normally CPU frequency and memory: increasing the CPU core’s speed and addressing more memory should result in a predictable uplift in throughput, and reduction in latency.
Ethos-U55 in Ensemble AI MCU Reduces Power Consumption
For common AI applications such as speech recognition or object detection, however, these same levers do not seem to work. Faced with long wait times for a conventional 32-bit MCU to perform inferencing at the edge, design engineers have tried upgrading their hardware – for instance, by migrating from an Arm® Cortex®-M0+ core to a Cortex-M4. In AI applications, however, the effect tends to be negligible.
There is a technical argument to be made for this: it boils down to the parallel nature of AI inferencing operations – a CPU excels in executing sequential instructions. So running a machine learning application at higher speed on a CPU is like spinning the wheels of a car: it uses more energy, but forward progress is slow.
This argument underlies the hybrid architecture of the Alif Semiconductor Ensemble™ family: like a conventional MCU, the Ensemble AI MCU includes a CPU – the Cortex-M55 for the control part of the application. But the Ensemble AI MCUs also have a dedicated hardware accelerator for AI functions, the Arm Ethos™-U55 micro neural processing unit (microNPU). The theory is that an architecture combining a dedicated AI accelerator with a CPU will perform better at machine learning tasks than an architecture based on a CPU alone.
But is the theory borne out in practice?
The answer is in the numbers. In a demonstration at Arm DevSummit 2021, an Ensemble E3 AI MCU (with single Cortex-M55 and Ethos-U55 cores) performed single inference image classification using the MobileNet V2 1.0 model 78 times faster with the microNPU enabled than when the same MCU only used its Cortex-M55 MCU core. Execution time for the Ensemble E3 when the microNPU is enabled was 8ms compared with 624ms for execution on the Cortex-M55 core alone, and please keep in mind that the Cortex-M55 is several times faster than earlier generation Cortex-M class cores for these types of workloads, meaning that an order core, such as the Cortex-M4 or Cortex-M33 would take even longer. The faster execution speed also has a major impact on the energy consumed for each inference. When accelerated by the microNPU, the energy consumed for the inference operation was only 3mJ, 76 times less than the 228mJ used when the inference operation was executed on the MCU core.
In fact, across a broad range of common machine learning functions, the addition of the Ethos-U55 core increases machine learning performance and reduces energy consumption by huge multiples, as the table shows. And the benchmarking provides a comparison to the Cortex-M55 core, which is the CPU that Arm has most modified for compatibility with machine learning operations: the contrast of benchmarking data would be even more stark against other Cortex-M cores.
The extra performance makes it possible to perform applications far more effectively than on a CPU-only MCU: for instance, Alif Semiconductor has demonstrated a facial recognition and tracking application, hosted on the Ensemble E3, running at 148 inferences per second, substantially faster than is possible using conventional MCUs.
High Efficiency (HE) System: Cortex-M55 and Ethos-U55 128MAC at 160MHz | ||||||||||
Model | Accelerated inferencing | CPU-bound inferencing (on Cortex-M55) | Improvement w. Acceleration | |||||||
Time ms | Power mW (∆) | Current mA (∆) | Energy mJ (∆) | Time ms | Power mW (∆) | Current mA (∆) | Energy mJ (∆) | Time | Energy efficiency | |
KWS1: MicroNet Medium (ARM) | 15.9 | 8.8 | 2.6 | 0.14 | 326 | 3.4 | 1.0 | 1.27 | 21x | 19x |
Object Detection2: YOLO-Fastest (face trained) | 18.6 | 14.2 | 4.2 | 0.27 | 1373 | 5.4 | 1.6 | 8.3 | 74x | 67x |
Auto Speech Recognition4: Tiny ASR (Wav2letter) | 78.6 | 10.0 | 3.0 | 0.69 | 8562 | 7.4 | 2.2 | 62.5 | 109x | 104x[TS1] |
High Performance (HP) System: Cortex-M55 and Ethos-U55 256MAC at 400MHz | ||||||||||
Model | Accelerated inferencing | CPU-bound inferencing (on Cortex-M55) | Improvement w. Acceleration | |||||||
Time ms | Power mW (∆) | Current mA (∆) | Energy mJ (∆) | Time ms | Power mW (∆) | Current mA (∆) | Energy mJ (∆) | Time Improve | Energy Improve | |
KWS1: MicroNet Medium (ARM) | 6.8 | 25.7 | 7.6 | 0.17 | 137 | 19.3 | 5.7 | 2.62 | 20x | 18x |
Object Detection2: YOLO-Fastest (face trained) | 7.3 | 33.8 | 10.0 | 0.25 | 657 | 21.3 | 6.3 | 13.7 | 90x | 76x |
Image Classification3: MobileNet v2 | 20.1 | 43.3 | 12.8 | 0.86 | 2707 | 24.0 | 7.1 | 62.4 | 135x | 108x |
Auto Speech Recognition4: Tiny ASR (Wav2letter) | 27.9 | 29.6 | 8.9 | 0.89 | 4534 | 17.3 | 5.2 | 77.8 | 163x | 138x |
2. Object Detection: 192×192 grayscale resolution & color. Quantized int8, trained on ‘WIDER FACE’ dataset. Model footprint: 431KB MRAM, 433KB SRAM
3. Image Classification: 224×224 24bit resolution & color. Quantized int8, trained on ‘ImageNet’ dataset. Model footprint: 3,552KB MRAM, 1,47KB SRAM
4. ASR: Tiny Wav2letter Pruned slotted into ARM’s ML demo app, running the ASR use case. MRAM=2346.06KB, SRAM=1197.20KB
Conclusion
The evidence of the numbers is clear: an architecture optimized for machine learning operations, such as the Ensemble AI MCU, produces markedly superior machine learning performance. So microcontroller users now have a better lever to pull to solve problems with slow or unreliable AI performance at the edge.