It’s fair to say that the NPU is likely to be the center of attention in an integrated AI/ML MCU. In an earlier blog, we described how fast, low-power AI/ML inferencing is the result of a combination of an NPU and CPU working hand-in-glove.
But that’s not all.
For the CPU and NPU to work at full speed, they need sufficient fast memory capacity and a fast pipeline between them. This is another domain in which the Ensemble and Balletto MCUs demonstrate their difference from MCUs that are not optimized for edge AI.
Looking at the representation of the Ensemble MCU memory topology, you can see a memory system of two parts:
- The upper half represents the real-time section, consisting of very fast Tightly Coupled Memory (TCM) connected to the CPU and NPU cores. For fast inference times, these TCM SRAM memories must be sufficiently large to hold the ML model’s tensor arena.
- The lower half shows other system memories connected by a common high-speed bus. A large, shared bulk SRAM is required to hold sensor data, such as the input from a camera and microphones. A large non-volatile memory contains the ML model itself plus application code.
When large on-chip memories are distributed this way to minimize competing bus traffic, then concurrent memory transactions flourish, bottlenecks are cleared, memory access times are minimized, and power consumption is compatible with the use of a small battery.

Fig. 2: The Ensemble MCUs’ internal memory topology
The speed at which data is transferred between peripherals such as sensor interfaces, memory, and the two types of processing resources in an Ensemble or Balletto MCU is a key reason for these devices’ superior performance in edge AI applications powered by a small battery.