Edge AI on microcontrollers (MCUs) has unleashed a new frontier for embedded AI MCU platforms, enabling ultra-low power devices to recognize voice, gestures, vibration patterns and more. This guide introduces key criteria and design choices that will assist in choosing the right MCU for your next edge AI project.
The term edge AI encompasses a broad range of applications, where engineers must balance memory, power consumption, latency, footprint, toolchains and more to achieve robust AI inference on battery powered devices to match theirs. Below is a table of evaluation criteria for MCUs for edge AI applications.
| Criterion | What to measure/check | Why it matters |
| Performance | Effective MAC operations per second, sustained throughput, CPU vs DSP vs NPU | Higher performance leads to lower latency, higher accuracy, and often higher power |
| Memory | Usable SRAM for activations and stacks; NVM/Flash/MRAM and code | More memory allows for larger and more performant models to be deployed. |
| Power | Sleep power, active power and energy per inference | Battery life is dominated by wake frequency and energy per active event |
| Latency | End-to-end sensor to decision latency, pre/post-processing, wakeup time | “Inference time” alone is misleading; pipelines (such as initial wakeup time) also matter |
| I/O & interfaces | PDM/I²S, SPI/I²C/I3C, CSI, CPI, ADCs; DMA maturity | Flexible interfaces and abundant I/Os allow for growth and changes as revisions advance |
| Wireless | Bluetooth/Wi-Fi/Cellular/Satellite impact on power, OTA bandwidth, and app architecture | The RF stack often dominates energy budgets and influences how you update devices, receive and transmit data, and communicate with nodes |
| Security | Secure boot, root of trust, TrustZone-M, secure debug, key storage, model IP protection | AI models are IP; products need signed updates and rollback, not just set programming bits |
| Software & tools | Support for helpful tools: TFLite Micro/TensorFlow Micro, CMSIS-NN, ExecuTorch, ONNX to TFLM | Well documented tools/software reduces development time, and support for popular tools allows for plug-and-play prototyping |
| Thermal & form factor | Package size, industrial temp range, board- level RF constraints | Choosing the right package for your application can simplify production and ensure size restrictions are met |
| Cost & supply | Lifecycle, availability, ecosystem maturity, certification needs | A MCU that can’t be sourced or supported is a project risk |
Sizing the MCU for Your Workload
When sizing an MCU for your application, an efficient hardware/software match (proper toolchains, quantization, and offloading) can yield enormous energy savings, both in engineering hours and product performance. Focusing only on peak multiply-accumulate (MAC) rates or FLOPs can be misleading. In embedded AI applications, energy and latency per inference are two key metrics.
When selecting an appropriate MCU, comparing the power consumption for a representative model helps to provide a real-world benchmark of performance across a range of models and applications. Many vendors now report microjoule-level energy per inference (especially when using an NPU).
Worst-case latency (compute + DMA overhead) behavior can make or break applications. For example, voice wake-word detection or gesture recognition might allow up to 50 ms inference time, but real-time vision tasks often need under 20 ms.
Multiple cores or NPUs can help achieve low latency targets: e.g. a dual-core Cortex-M with an NPU can reduce total delay by running computation in parallel. Remember to account for wake-up costs too (deep-sleep exit can cost tens of microseconds). Ensure that the MCU’s wake-up and pipeline-fill delays fit your real-time budget.
An important detail to consider is how much of the computation is offloaded to accelerators. If most MAC operations run on the NPU, the MCU’s Cortex-M core can remain idle (using negligible power). Verify that data transfers (sensor → memory → NPU) can happen via DMA to offload the CPU. Check interrupt and cache coherency capability too; a poorly handled interrupt during inference can stall pipelines and waste cycles.
For edge AI workloads that involve short inference followed by long sleep, peak operations-per-second (usually in the GOPS or TOPS range) matters less than average efficiency. When evaluating devices, consider both the peak compute (for worst-case latency) and the average current draw at your intended duty cycle.
Compute Choices: CPU vs DSP vs NPU
Different always-on tasks favor different compute engines. For ultra-low-power audio or IMU tasks (keyword spotting, anomaly detection), a single Arm® Ethos™-M55 (with Arm® Helium DSP extensions) suffices for smaller neural nets. These tasks typically benefit from continuous DSP filtering followed by infrequent neural-net invocations.
Larger models (such as convolutional networks for vision) often require a dedicated accelerator to achieve low latency inference. An embedded NPU or a dual-core MCU can deliver much higher throughput per watt for heavy workloads. In such designs, the workload is often split: the MCU core handles low-speed I/O and control logic, while an NPU performs the bulk of model inference. Alif’s Ensemble® MCUs showcase this approach: they have two Arm® Cortex™-M55 subsystems (high-efficiency and high-performance) each with its own Arm® Ethos™-U55 NPU.

In practice, it helps to map use cases to compute architecture. Alif Semiconductor® integrates on-chip, Arm® Helium DSP/vector acceleration on the Cortex-M55, and an NPU can be added for AI inference.
- DSP/vector + NPU-capable (Arm® Helium-first, NPU when needed): For models (e.g. larger keyword nets and compact vision convnets), leverage the Arm® Ethos™ with Arm® Helium vector processing for efficient DSP and pre/post-processing, and use the integrated Arm® Ethos™-U55 where it meaningfully reduces active time.
- Hybrid (CPU+NPU): For heavier workloads (large CNNs, multi-sensor fusion, or simultaneous tasks), the CPU handles control flow and preprocessing while the Arm® Ethos™-U55 executes MAC-heavy layers, reducing latency and energy by finishing quickly and returning to low-power states.
Power Budgeting and Estimating Battery Life
Raw benchmark numbers (e.g., peak TOPS) are often misleading for battery-powered use cases; focus on inference latency, wake-up time, and energy-per-inference. For instance, consider a periodic wake-word detector: if the MCU takes 10 ms to wake and 5 ms to infer, it may consume dozens of μJ each cycle despite a high MAC rate. Measure end-to-end latency (sensor → preprocessing → inference → action) and compute energy-per-inference under real code and data conditions, not just MACs/s.
Duty cycling greatly improves battery life. For example, doing a millisecond of work followed by seconds of deep sleep can yield orders of magnitude longer life than running continuously. In one study, local on-device CNN inference reduced transmission energy by a factor of 5 by sending only an 8-bit class indicator instead of a 224×224 image. In that scenario total power (including radio) was far lower than streaming raw data.
Key techniques include:
- Duty-cycling: Burst computation then sleep. This greatly extends battery life.
- DMA & Concurrency: Offload data movement via DMA so the CPU can sleep. Use multiple cores (or NPU) so one processor can compute while another sleeps. Concurrency can boost throughput without extra power draw.
- Full-System Profiling: Always measure on real hardware. Run your model on an evaluation board and profile the entire pipeline (including sensors, DMA, interrupts, radios).
Synthetic benchmarks often miss hidden power drains that appear in real workloads. By carefully budgeting power, you ensure that your MCU design can meet battery-life targets.
Memory Planning for Edge AI Models
Small MCUs have very limited memory, so planning is critical. Models are often stored in Flash memory whereas SRAM (often split into multiple banks or TCM blocks) holds model activation weights, and intermediate buffers. As a rule of thumb, ensure SRAM exceeds activation size (model output), or you may need two-pass “tiling” with DMA. Use quantized int8 models whenever possible to shrink both Flash and RAM usage without significant loss of accuracy.
Estimate your model’s quantized size and compare it to on-chip Flash. If it’s too large, either compress (prune) or plan for external memory (e.g. QSPI Flash). Some MCUs (including the Ensemble® and Balletto® MCUs from Alif Semiconductor) use MRAM, which allows for execute-in-place (XIP) operation, and finer-grained updates for model storage. Determine your network’s peak activation footprint using tools or frameworks. This buffer must fit in fast SRAM or coupled memory. DSP or NPU accelerators often have smaller on-chip caches, so large intermediate buffers can bottleneck performance.
For streamed data (audio frames, sensor bursts), DMA can be used to preload the next input or weights while the CPU/NPU works. This reduces latency but requires enough SRAM to hold two sets of data. Alif Semiconductor MCUs provides multiple DMA engines for this purpose. A bigger model may exceed SRAM while a smaller one can fit but take more inferences to achieve the same confidence. Balance your accuracy needs with the memory profile.
Remember to take advantage of features such as XIP from external Flash for code (and even models) to minimize initial load time. Many TinyML toolchains (TensorFlow Lite Micro, CMSIS-NN) report the model’s SRAM and Flash usage during compilation. Use these estimates to guide MCU selection.
Security, Updates, and Lifecycle Management
Security in MCUs running AI inference involves protecting both firmware and model data. Alif Semiconductor’s Ensemble devices anchor security in a dedicated Secure Enclave that’s isolated from the application fabric. It is responsible for early system configuration and securely booting the application subsystems from an immutable root of trust, with secure key services and controlled debug/readout. Treat both firmware and ML models as signed assets and design your update flow so the Secure Enclave verifies authenticity before boot; pair this with rollback resistance and model IP protection as part of Alif’s lifecycle approach.
Privacy improves when inference stays on-device: transmit only events, summaries, or health metrics rather than raw audio or images. Plan secure over-the-air (OTA) updates for both code and models: sign the packages, encrypt communications, and ensure the MCU can authenticate and safely apply updates without losing functionality.
Connectivity & Deployment Patterns
After inference, many devices need to send results or receive updates. Choosing the correct wireless communication interface for your application can dramatically improve battery life.
In general, Bluetooth Low Energy (BLE) is far more power-efficient than Wi‑Fi or cellular. Many BLE 5.x radios consume microwatts in deep sleep states and require only milliseconds to transmit a few bytes. BLE is simple for point-to-point telemetry and supports standard update services (DFU).
Wi-Fi offers high throughput at the cost of higher power consumption and longer connection times. Cellular IoT (LTE-M/NB-IoT) covers a long range but also incurs longer wake-up latency (network attach) and moderate power. Otherwise for sensors that send occasional small packets or firmware deltas to a central node (often running a Wi-Fi, LTE, or satellite communication stack), BLE often suffices. It offers low sleep current, sufficient range (with Long Range/LE Audio features), and a reliable low-power link for periodic telemetry and OTA updates.
Toolchains and Workflow (From Model to Production)
The MCU ecosystem you choose must support your framework. For example, the Arm® Ethos™ NPUs in Alif Semiconductor MCUs now support ExecuTorch (a PyTorch subset) that compiles models for use in resource-constrained systems.
To then achieve the best efficiency, the model and software stack deployed onto the MCU can be adapted in several ways:
- Memory and energy are dominated by data movement, so shrinking a model’s footprint can help improve these metrics. Techniques include reducing weight precision and pruning redundant weights. For instance, using 8-bits instead of 32-bit can cut model size and speed up MACs on NPUs or SIMD units. Many tools support quantization-aware training or post-training quantization with minimal accuracy loss.
- Use operations (FFT, MFCC, etc.) and window sizes that fit your MCU’s energy budget. For example, using 16-bit instead of 32-bit activations halves memory usage. Tailor the ML pipeline so that heavy DSP operations (like FFT or audio front ends) use efficient libraries, and the final neural network is as small as possible.
- Explore Alif Semiconductor’s Conductor and Ensemble Developer Studio toolchains to assist in model deployment onto hardware. With ExecuTorch support, you can now bring PyTorch models to Alif Semiconductor MCUs without a separate conversion step, and the Arm® Ethos™ NPUs in Alif Semiconductor MCUs natively accept TOSA (Tensor Operator Set Architecture), easing integration.
- After initial deployment, profile the model on the actual hardware. Measure inference time, peak memory, and current draw. If performance is too slow or the device is too power-hungry, return to compression or try a slightly smaller model. This tuning loop (prune/quantize → deploy → measure) is standard in TinyML development to meet embedded constraints.
- Leverage available resources. Alif Semiconductor Ensemble DevKit allows you to try real AI workloads on actual hardware early in the design process. Alif’s sales team can recommend a specific family member or kit based on your target models.