More than one in four people currently integrate speech recognition into their daily lives. A new algorithm developed by a University of Copenhagen researcher and his international colleagues makes it possible to interact with digital assistants like “Siri” without any internet connection. The innovation allows for speech recognition to be used anywhere, even in situations where security is paramount.
Talking to a computer was once the stuff of science fiction. Nowadays, saying “Hey Siri,” or Alexa, Google or other digital assistant on a smartphone or other interactive gizmo has become commonplace. Yet, in the future, the role of speech recognition may become even more important.
While studies suggest that these technologies are already used by one in four people on a regular basis, should predictions hold true, by 2025 the number of devices equipped with speech recognition will exceed the planet’s population. And the technology is still evolving.
Until now, speech recognition has relied upon a device being connected to the internet. This is because the algorithms typically used for this process require significant amounts of temporary random access memory (RAM) which is usually provided by powerful data center servers. Indeed, try switching your smartphone to airplane mode and see how far your voice commands get you. But change is in the air.
A new algorithm developed by Professor Panagiotis Karras from the University of Copenhagen’s Department of Computer Science, together with linguist Nassos Katsamanis of the Athena Research Center in Greece, and researchers from Aalto University in Finland and KTH in Sweden, allows even smaller devices like smartphones to decode speech without needing substantial memory—or internet access.
The code, recently presented at the Interspeech 2024 conference, employs a clever strategy: it “forgets” what it doesn’t need in real-time.
“Speech recognition fundamentally works by matching the small speech sounds we use to form words and sentences—known as phonemes—with a library of corresponding sounds,” explains Panagiotis Karras. “Probabilities are calculated for matches and the subsequent combinations that go on to form our words and sentences. The most likely sequences are calculated and the software translates these sounds into text.”
Current algorithms require increased memory the longer one speaks, as all alternative combinations must remain open until the final sound is analyzed. The new algorithm does away with this problem.
“The algorithm conceived by Panos and developed further by our team, does something entirely new,” says co-developer and co-author Katsamanis. “Unlike the existing gold standard algorithm used since speech recognition’s early days, our algorithm only stores a fraction of the processing data, serving as a set of ‘coordinates.’ With these, an entire sequence can be reconstructed, which makes speech recognition possible with significantly less RAM.”
From keywords to entire sentences
This maneuver may sound simple, but it involves an entirely new and unique code for which the researchers have sought a patent. This algorithm reduces the need for critical memory without sacrificing recognition quality. And though it requires slightly more time and computational power, the researchers say that the difference is negligible vis-à-vis the muscular capabilities of modern devices.
Moreover, it works without an internet connection, thus enabling speech recognition—and potentially real-time language translation in the future, hope the researchers—anywhere, even in the depths of the Amazon jungle.
Single words or very short sentences are generally manageable when current software needs to store alternative sequences and libraries of potential sound interpretations. However, as sentences become longer and potential word combinations more complex, the demand for RAM increases.
“Certain small devices can already recognize and act based upon a few words without internet connectivity. For example, a smart home system can recognize keywords such as ‘turn on’ or ‘turn off.’ This is known as small-vocabulary speech recognition. With our algorithm, it will be possible to recognize more extensive instructions or, in principle, entire languages—without an internet connection. This is referred to as large-vocabulary speech recognition,” says Professor Karras.
Enhanced inclusion, security, and energy savings
According to the researchers, the invention opens up a range of possibilities—from practical, security-related, and societal benefits—to its significant energy-saving potential.
For instance, many people could benefit from the ability to translate foreign languages while traveling, regardless of internet access. This is one possibility that the researchers hope to achieve. But, the societal impact of linguistic accessibility, both now and in the future, could be far more significant.
Katsamanis sees great promise in the technology: “This algorithm can help democratize language technology by making information more accessible. To make translation tools and speech assistants available regardless of internet access will allow more people to engage in society. In particular, it will help people without written language skills or those with physically disabilities, by enabling them to understand and influence societal decisions.”
Another key advantage of this speech recognition invention is its security implications. When security is paramount, the new algorithm addresses a significant problem: internet connections can be hacked. By eliminating the need for internet access, the algorithm enhances security.
Furthermore, while the energy used by data centers to support current speech recognition technology may be invisible to consumers, it is highly relevant in a world facing climate change. The growing demand for this technology, when met by this invention, could lead to significant energy savings by reducing the enormous need for temporary memory.
“It is vital to reduce energy consumption to minimize reliance on fossil fuels, as many data centers still use these energy sources,” concludes Professor Karras.