Search

KWS on MCU and NPU: Achieving Sub-10 ms Inference at Low Power

Why KWS Is the Benchmark Edge-AI Workload on MCUs

Key spotting (KWS) needs to hit two targets at the same time: fast reaction, and low energy.

Alif Semiconductor® sees KWS show up again and again because it’s the smallest always-on edge AI workload that still exposes the real bottlenecks: audio capture, MFCC preprocessing, model operator support, memory placement, and scheduling. This article focuses on those details, specifically for getting to ≤10ms end-to-end latency on an MCU, and when adding an NPU pays off.

Why KWS stresses latency and power simultaneously

KWS is a benchmark because the workload is stable: constant audio input, predictable compute stages, and a repeatable timing loop. It also maps directly to product value. Wake words reduce buttons, screens, and always-connected dependencies.

  • Always-on audio means sampling continuously. Even if inference runs in bursts, the audio front-end and framing loop never stops.
  • Tight real-time constraints. KWS is judged by the time from the last audio slice needed for a decision, to an output trigger.

If compute time creeps too close to the stride, you start dropping frames, increasing latency, or both.

The KWS Pipeline on an MCU

A KWS loop is only as fast as its slowest stage. If you want ≤10ms end-to-end latency, you must budget and optimize each stage.

Audio capture and framing

Most KWS pipelines follow this structure:

  • Capture: PDM/I²S mic → DMA → ring buffer
  • Frame: slice into overlapping windows
  • Convert (optional): stereo → mono, scale, DC removal

A key engineering decision is whether the Central Processing Unit (CPU) touches every sample.

Direct Memory Access (DMA) handling reduces CPU workload by allowing the processor to wake only when a buffer ready interrupt occurs, improving both latency and energy efficiency. In the Ensemble® family, DMA is treated as a core feature. This capability is important because it enables continuous data streaming, such as audio input, mel-frequency cepstral coefficients (MFCC) computation, and feeding inference buffers without requiring the CPU to manually copy each block of data. A practical implementation for a 16kHz system is a buffering scheme, where two buffers are used alternately, while DMA fills one buffer, the CPU processes the other, and an interrupt swaps their roles at each stride interval.

MFCC preprocessing cost

(Mel-Frequency Cepstral Coefficients (MFCC) is a feature representation that captures the perceptually relevant spectral characteristics of audio, and it’s important for keyword spotting (KWS) because it compresses speech into a compact form that highlights phonetic content while reducing noise and variability. MFCC computation can dominate the compute budget in keyword spotting (KWS) pipelines, often outweighing the cost of the model itself. In Arm® Cortex-M55 optimization work, we observed a pipeline where inference on the Arm® Ethos-U55 remained constant at 3.35ms, while digital signal preprocessing for MFCC generation ranged from 14.6 to 21.9ms depending on compiler toolchains and optimization flags. This highlights a key constraint.

If MFCCs are recomputed from scratch for every overlapping window, preprocessing alone can break real-time performance. Two optimizations help address this. First, incremental MFCC computes only the newest slice of data for each stride while reusing previously computed features. Second, overlap reuse takes advantage of shared data between adjacent windows. For example, with 1 second overlapping samples, up to 50% of the MFCCs can be reused from the prior window, effectively halving digital signal processing (DSP) time.

Moving Inference from CPU to NPU

A neural processing unit (NPU) changes your latency and energy profile only if your model is mostly made of operators that the NPU accelerates, and tensors live in memory that the NPU can access efficiently.

What parts of KWS benefit from an NPU

Most KWS networks are convolution-heavy (especially DS‑CNN style architectures). Those layers are the sweet spot for an NPU. In the Ensemble family, AI island pairs an Arm Cortes-M55 control core with an Arm Ethos‑U55 microNPU.

For example, the E1 series integrates a single Cortex-M55 up to 160 MHz with one Ethos-U55 delivering up to 46 GOPS (128 MAC/cycle), along with dedicated DMA support for efficient weight streaming. This makes it well suited for always-on KWS, where the M55 handles continuous DSP tasks like MFCC extraction while the U55 accelerates convolution-heavy inference at low power.

The E3 series extends this concept into a dual-domain architecture, combining two Cortex-M55 cores (typically around 160MHz and 400MHz) with two Ethos-U55 NPUs, one configured at 128 MAC/cycle (46 GOPS) and another at 256 MAC/cycle (204 GOPS).

That configuration is valuable for KWS because you can keep a low-power always-on loop running, and only wake the high-performance domain when it’s needed.

Operator coverage and fallback paths

This is where many NPU deployments go wrong. A model may successfully compile, but parts of it silently execute on the CPU instead of the accelerator. With the Arm Ethos-U55, this behavior is expected – only a defined subset of operators runs on the NPU – and any unsupported operations automatically fall back to the Cortex-M CPU, often via CMSIS-NN or reference kernels. This means a model can appear accelerated while still spending a significant portion of its time on the CPU.

In Alif’s end-to-end setups are explicit, allowing fallback but ensuring models are pre-optimized with tools like Vela, which partitions the graph and maximizes NPU coverage. The key takeaway is to design for the NPU from the start, preferring standard convolution and depthwise convolution blocks that map cleanly to the accelerator. Be cautious of operations that trigger fallback, such as non-quantized layers, custom activations, unsupported reshape patterns, or unusual padding. Most importantly, treat fallback percentage as a important KPI, because every operator that drops back to the CPU directly increases latency and energy consumption.

Achieving ≤10ms End-to-End Latency

A reliable KWS pipeline avoids blocking execution by overlapping stages across hardware. At a 20ms stride, a proven pattern is to let DMA capture the next audio frame while the CPU computes MFCCs for the previous buffer, the NPU runs inference on the prior feature tensor, and the CPU performs lightweight post-processing (such as smoothing and thresholding). This keeps all units active and maintains real-time performance.

In contrast, capture → MFCC → inference → post sequence in a single thread may work in isolation but typically break under interrupt jitter or slight workload growth. A realistic approach, as seen in Alif’s ML Embedded Evaluation Kit, uses an always-on core to detect keywords and only wakes a higher-performance domain when needed, keeping latency low without running the full system continuously.

Clocking and memory placement

Sustaining this pipeline depends on efficient clocking and memory use. The always-on path should remain in a low-power domain with fast local memory, while higher-performance cores and memory regions are only activated on demand. DMA ensures data moves between stages without CPU overhead, so each stage receives data on time. Poor placement, such as unnecessary memory transfers or running everything in a high-performance domain, adds latency and power cost. The goal is to keep the steady-state loop lightweight and deterministic, scaling compute and bandwidth only when a keyword event requires it.

Power Breakdown: Where the Milliwatts Go

Power in KWS is governed by how often each stage runs and how long it stays active.

Audio front-end vs inference power

Two workloads dominate:

  • Always-on: mic + capture path + framing interrupts
  • Burst compute: MFCC + inference + post

If you run KWS every 500ms, the difference between small changes in MFCC runtime can meaningfully change CPU utilization and therefore average power. In Alif’s whitepaper, changing toolchain settings moved CPU utilization between 3% and 4% at a 500ms schedule.

CPU vs NPU energy comparison

The metric to optimize is energy per inference, not peak power.

On Alif’s published benchmark table (MicroNet Medium KWS, int8, Speech Commands), the accelerated path on an Ensemble high-performance system shows 6.8ms inference time, and 0.17mJ energy per inference. Compared to CPU-only inference 137ms inference time, and 2.62mJ energy per inference.

That’s the realistic reason to use an NPU. It finishes quickly and returns the system to low-power sooner. The same table provides the model footprint context, 154KB MRAM and 28KB SRAM for the MicroNet KWS model.

Real Logs and What They Show

Fast claims without logs don’t help engineers. Logs are how you prove where time and energy are going.

Latency logs

A KWS latency log should show:

  • DMA buffer-ready timestamp
  • MFCC slice start/stop
  • NPU inference start/stop (and whether any fallback ran on CPU)
  • Post-processing end (decision time)
  • Wake/sleep transitions (so you see padding)

For precise stage timing, we use the Cortex‑M55 PMU cycle counter to time each stage.

For deeper debugging, Ensemble devices support CoreSight trace options including ETR-based trace into SRAM.

Power logs

Power logs need two views:

  • Event-level trace: wake spikes, MFCC burst, NPU burst, tail current.
  • Averaged current: battery-life projection over minutes/hours.

This is where hardware matters for validation. The Ensemble E8 DevKit is designed to expose signals so you can access pins for power and performance profiling without redesigning your own board on day one.

When KWS on MCU+NPU Is the Right Choice

Not every wake word needs an NPU. The right answer depends on model complexity, languages, and the rest of the workload.

When CPU-only is sufficient

CPU-only makes sense when:

  • Your model is tiny (single wake word, limited noise robustness)
  • You’re comfortable lowering sample rate/feature resolution
  • You can afford longer inference windows
  • Even then, MFCC can still dominate time.

That’s why using a DSP-capable MCU core matters: Ensemble E1’s Cortex‑M55 includes Helium vector processing plus a double-precision FPU and zero-wait TCM for tight loops.

When NPU acceleration pays off

NPU acceleration pays off when:

  • You need multi-keyword or multi-language support
  • You want stronger noise robustness without ballooning latency
  • You want to run at lower clocks for energy, yet keep response time tight
  • KWS is the gate for larger workloads

This architecture uses always-on lightweight monitoring, waking heavier compute only when needed. This is exactly what MountAIn used in its solar-powered IBEX device: it runs a high-efficiency always-on block continuously and wakes the high-performance block for heavier neural inference on relevant samples.

The same pattern shows up in Alif’s voice-focused case material as well: for example, hearing-aid case study highlights using the Helium M55 for audio preprocessing and the Ethos‑U55 for neural networks, keeping latency low while maintaining battery life targets.

Contact Alif sales to discuss your KWS latency and power targets, and help you map the pipeline (DMA framing, MFCC strategy, operator coverage, memory placement) onto the right Alif’s Ensemble device.

X

This field is for validation purposes and should be left unchanged.
(Required)