If you’ve not been on a desert island for the past 12 months, you cannot have missed the blizzard of announcements from chip manufacturers claiming to enable a thing called ‘edge AI’ – and increasingly, AI at the endpoint, where power and compute resources are even more constrained than in edge devices such as gateways.
Performing AI inferencing at the edge or endpoint rather than in the cloud offers multiple benefits, including lower latency, reduced frequency of wireless data transmissions to give lower power consumption, and lower cloud computing costs.
This is all good. And Alif Semiconductor products such as the Ensemble and wireless Balletto microcontrollers are already running AI models at astonishingly low power and low latency in customer designs for endpoint devices. Now OEMs are excited about the scope to implement not just conventional AI models such as CNNs and RNNs at the endpoint, but generative AI as well.
In the cloud, generative AI based on large language models (LLMs) can basically answer any question about anything with at least a reasonable degree of accuracy. But to run any kind of LLM requires huge data storage and access to fast DRAM memory – out of the question for endpoint devices, for cost and power reasons.
So as we have explained, endpoint implementations of generative AI will need to use specially modified software such as small language models (SLMs), which do not provide the ‘answer everything’ quality of an LLM.
At first glance, this might look like a simple trade-off: the low power, low latency and low cost of local inferencing at the endpoint traded for the universal knowledge of a cloud-based model. So you might think that typical endpoint generative AI implementations will be just like generative AI in the cloud, but with limitations: for instance, smart glasses that can do real-time text translation, but instead of translating any language into English, the smart glasses might use an SLM which only translates Mandarin Chinese to English, and only translates a subset of the 1,000 most used words in Chinese, and only handles text, and not speech.
This is one approach, but it is not the only one.
In fact, system architects are finding that a blended approach could enable OEMs to get the best of both worlds: both cloud and local inferencing. In this blended model, the endpoint acts as a kind of ‘first responder’ for a systematic AI application.
For instance, smart glasses might offer the capability to translate any foreign-language text in the field of view into the wearer’s native language. This means that the glasses need to know:
- That the wearer is looking at text
- That the text is in a foreign language
- How to translate the foreign words into native language words
If the smart glasses have to perform in the cloud all the inferencing required for these three functions, they will be transmitting vast quantities of image data to the cloud all the time: this will consume huge amounts of energy to power the wireless transmissions, and will likely also incur huge financial cost both for the cloud computing operations and data storage, and for metered data carried over a mobile network. And in the end, most of the data sent to the cloud will end up being discarded, because it will turn out to not be images of foreign language text.
Much better, then, to process the scene in the wearer’s field of view locally, and determine there whether it is foreign language text. With a system for detecting the direction of the user’s gaze, the smart glasses can determine when they are looking at text, such as a menu in a restaurant. Vision AI software in the glasses can then highlight the part of the menu that the wearer is looking at. Generative AI in the glasses can work out that the text is in a foreign language.
Only then will the glasses send an image to the cloud for translation – and the image will only include the snippet of the field of view that includes the text for translation. The transmission typically goes via the user’s smartphone acting as an internet access point.
So the endpoint AI system works as a filter, discarding all the sensor data about the user’s environment that is not relevant to the user’s requirements, and only uploading to the cloud data for inferences that the endpoint cannot process itself. The result: a smooth, responsive and unobtrusive wearable AI system which operates on a small battery, and incurs minimal data and cloud computing costs.
And because these generative and conventional AI filtering functions are implemented locally in a microcontroller – the Ensemble or Balletto products are ideal – the power consumption, size, weight and cost of the circuit are appropriate for endpoint designs.
Market sentiment suggests that this could be a common approach to the implementation of generative AI at the endpoint. Alif expects to be at the epicenter of this new development as its second generation Ensemble MCUs, which support transformer networks and so enable local generative AI inferencing at the endpoint, become available in 2025.