Voice as a part of AI tech goes down from the cloud to the edge
Updated: Nov 30, 2022
Voice technologies have been around for over thirty years. However, their consumer-, and especially industry-, adoption have always been, regrettably, low. The reason for this was the poor reliability and accuracy of voice solutions (engines), which made them impractical. The main voice technology - automatic speech recognition, or ASR, serves to transcribe voice to text. It is a compute-intensive technology, requiring significant processing power and memory resources in order to deliver good results.
The good news is that this technology has made great leaps in the past few years. This is due to revolutionary advances in artificial intelligence (AI) and machine learning (ML). AI advances, in their turn, owe their success and quick proliferation to the incredible surge of computational power they received from cloud technology. With this power, all neural networks such as DNN (deep neural networks), RNN (recurrent neural networks) and others that lie at the base of speech technologies, have become very effective and deliver much better results. With the power of the cloud, speech recognition accuracy, measured in WER (word error rate), surpassed 95% accuracy. Analysts say that this level of accuracy is a real game changer. It allows the development of reliable industrial and consumer products with a high level of market adoption. That’s why, today, we see and use voice applications integrated in many existing popular platforms such as WhatsApp, Waze and more.
The advent of cloud-empowered AI has opened yet another horizon for voice applications - speech and language processing. Now that we have speech accurately transcribed to text, the next step is to understand the language semantics and, therefore, the intention. This is a quite a task for artificial intelligence. It is called natural language processing (NLP) and natural language understanding (NLU). With these abilities, the voice is transformed to a smart voice. To implement this smart NLU, the technology intensively uses machine learning. This is another computation-consuming part of smart voice that requires powerful processing and substantial memory resources, both of which are widely available in the cloud.
An additional component of smart voice is conversational capability. If, in a solution or product (smart-appliance, smart-vehicle), the artificial “brain” that can understand the user’s speech, it is only natural for it to respond to the user in speech as well. This is where dialog management (DM) comes in as a part of conversational AI.
This would have been a “happily-ever-after”, until we found that the cloud has its disadvantages. Firstly, any cloud computational power costs money. Secondly, the significant power consumption of cloud farms become real plants of digital industry, requiring expensive energy-consuming cooling technologies, which in turn have negative environmental impacts. Thirdly, there is a distinct communication latency in the transmission of data from local apps and devices to the cloud, and back. In many cases, this latency is critical for real-time applications.
Another significant disadvantage of the cloud is related to the security, safety and privacy of data delivered to the cloud for processing. For voice and language applications especially, safety and privacy concerns become a real factor in the decision to use cloud services at all.
With the growing computational power of mobile and small devices all around us, we see the focus turning back to them. This is called Edge computing. According to Wikipedia, Edge computing is “a distributed computing paradigm which brings computation and data storage closer to the location where it is needed”.
The question then becomes how to effectively operate these smart, AI-infused devices. Introducing cognitive capabilities to everyday objects make it possible to interact with them via speech, in a natural way that is similar to human interaction. This means that we could literally speak to our devices using free-conversation, the most natural and effective user interface with which we can control the world around us.
With the IoT revolution, it became obvious that smart voice interfaces will be very effective to work in many places in the IoT world. Analysts say that, soon, the voice in IoT will become the “voice for everything”.
This will work in many industry 4.0 segments such as manufacturing, transportation, logistics and for consumer segments such as retail, smart home, healthcare and others. The use of smart voice will be particularly indispensable for wearable devices, small smart devices that do not have any haptic interface, and in cases when the user has both hands busy, such as while driving or operating complex manufacturing machines or robots.
Today, there are just over 8 billion connected IoT devices. Imagine a world of 1 trillion connected devices! IoT is well on track to this imagined future, with most devices operating at the network edge, many running compute-intensive tasks. Smart voice will serve here very well.
Any smart device or machine with the capability to interact with users via smart voice must be able to process voice and language in an autonomous, embedded mode, and, in many cases, without a cloud connection.
This is the issue that technology experts are trying to resolve. One solution is the use of advanced hardware technologies. The second solution is the innovation, or optimization, of neural network processing algorithms.
On the hardware side of things, a number of new effective platforms, with specialized architecture developed for AI processing and acceleration, have been introduced. Leaders of this development include tech giants Google and Intel. Microsoft has recently joined this race as well, by offering their GraphCore IPU (Intelligent Processing Unit). However, these solutions still consume too much power in order to work embedded (on the dge tasks).
To meet edge processing requirements, GreenWaves Technologies introduced GAP8, an app processor designed to do sound, image and vibration AI-analyses on battery-operated devices. Its engine accelerates inference (intelligent conclusion) calculations for convolutional neural networks (CNNs). GreenWaves offers an API and a rich compute library included in their SDK. It offers dynamically adjustable power consumption points from 1mW to 60mW, as well as a standby mode.
Startup company Croq offers an innovative architecture in its Tensor Streaming Processor (TSP) that operates highly effectively, while using minimal memory resources. Its performance is four times faster than the highest tier GPU (graphic processing unit) offered by Nvidia, a world leader in AI processing HW platforms.
Recently a new and fast-growing field of machine learning technologies and applications has appeared – “tinyML”, or tiny machine learning. This field includes hardware (e.g. dedicated integrated circuits), algorithms, and software capable of on-device sensor (e.g. audio, video, biomedical) data analytics at a minimal power consumption (typically in the mW range and below) which enables a variety of always-on use cases targeting battery-operated devices. TinyML hardware is quickly becoming “good enough” for many industrial applications. This reflects the significant progress made in algorithms, networks, and compact neural network models, working down to 100kB and lower.
To meet on-device AI requirements, Google revealed Coral, its little-known initiative. Coral is a platform of hardware and software components that help to build devices with local AI — providing hardware acceleration for neural networks right on edge devices.
The heart of the HW module shown here is the Google Edge TPU (tensor processing unit), an ASIC chip optimized to run lightweight machine learning algorithms in IoT devices.
Market analysts say that more than 750 million edge AI chips and computers will be sold in 2020, rising to 1.5 billion by 2024. (The Verge, Jan 14, 2020).
Despite the high AI performance of new tiny TPU and TSP processors optimized for the edge, their industry proliferation will still take time. The majority of embedded processors today are of standard general CPU or RISC architecture. Most popular among them is the RISK ARM family. The architectural simplicity of ARM processors allows building very small implementations with minor power consumption. This makes them good enough to empower edge computing tasks in an embedded and IoT world. The advanced ARM family of 64-bit Cortex processors demonstrate excellent performance in embedded applications. The ARM Cortex-A72, operating Raspberry Pi 4, makes it a prominent platform for the compute-intensive tasks needed for building smart speakers and voice assistants, working without connection to the cloud.
A year ago, ARM, the world leader of embedded computing, introduced Neoverse, the cloud-to-edge infrastructure foundation built for a future of 1 trillion intelligent devices, giving the ability to build products that span cloud to edge. This will be a hyper-distributed infrastructure.
Edge computing augments cloud and on-premises to enable new customer experiences. However, it is not likely to replace the cloud; rather, the cloud will be used, when applicable, to support a new smart IoT ecosystem on the edge.
A diversity of architectures that include silicon, system, and software running on edge platforms will soon be the norm. Whether it be TPU or RISK architectures using AI-optimized software, this promises conversational AI implementations running on the incoming wave of embedded smart devices and industrial IoT.
Onvego operates in this cloud-to-edge computational domain. It provides its conversational AI solutions using a hybrid approach. Running their solutions on small embedded platforms, such as Raspberry Pi and others, the company delivers various smart voice products to customers requiring intelligent voice services operating in manufacturing, logistics and transportation sectors, and demanding edge and cloudless implementations.