Search

The rights and wrongs of scaling generative AI software for the endpoint

Generative AI was born in the cloud.

The applications for generative AI that have captured the popular imagination, such as the ChatGPT or Gemini AI assistants or Adobe Firefly image creator, operate with models implemented in gargantuan code bases, and trained on data from the entire internet. Only cloud data centers have sufficient compute resources to run these applications and provide inferencing results with acceptable latency.

When we think about deploying generative AI at the endpoint, the natural instinct is to adapt cloud-based systems to fit the much smaller compute resources available in an embedded device. In fact, an approach based on adaptation is in many cases the wrong one: endpoint AI needs endpoint-native generative AI technology.

In an earlier blog, we described how a microcontroller’s hardware architecture needs to be optimized if it is to implement generative AI at the endpoint. Endpoint devices also need generative AI software which is optimized for the endpoint.

This means rejecting the obvious and superficially easier approach: using brute force scaling – quantization – to reduce the size of the models behind cloud-based generative AI: software such as the Llama or Gemini large language models. Brute force scaling can reduce the size of a model’s code base down to the megabyte range required by embedded devices.

But severe quantization has a damaging effect on an AI model’s accuracy and quality of inferencing. In fact, the LLMs are universal ‘models of everything’ – they are intended to be able to respond to any prompt about any domain or phenomenon known to humans.

At the endpoint, effective generative AI applications do not need to be universal: it is in the nature of embedded devices that features of their operation are context- and environment-specific: for instance, a home automation hub does not need to ‘understand’ all physical phenomena that humans have studied, but only a limited set such as temperature, sound and air quality in ranges found in homes.

So the model used to implement generative AI in this type of endpoint device does not need to be a universal model of everything, but a specially curated model which is trained only on relevant datasets. Similarly, for text- or speech-based applications, endpoint devices should use a specially trained small language model (SLM), rather than a slimmed-down version of an LLM.

In optimizing the model for the endpoint, developers can also take the opportunity to eliminate functions such as transpose operators that map badly on to an MCU’s neural processing unit (NPU), the workhorse of AI compute at the endpoint.  

Even the best endpoint AI MCUs – the Ensemble and Balletto products supplied by Alif Semiconductor, which offer an attractive combination of compute performance, bandwidth and low power consumption – impose hardware constraints. Optimization rather than adaptation is what will enable the device manufacturer to create generative AI software which can fit within these constraints.  

X

(Required)
This field is for validation purposes and should be left unchanged.