The smartphone was taught to “read lips” to increase the accuracy of speech recognition

Short description

Researchers at the St. Petersburg Federal Research Center of the Russian Academy of Sciences have developed an application that recognises human speech by lips, using artificial intelligence algorithms and “computer vision”. The application improves the accuracy of voice assistants in noisy environments, such as crowded places, or when using heavy machinery. The program uses information from both audio and visual signals to recognise human commands more effectively. The application could be used in a range of industries, including heavy industrial machinery, airplanes, and interactive information kiosks in shopping centres. The project was supported by a grant from the Russian National Science Foundation.

The smartphone was taught to “read lips” to increase the accuracy of speech recognition

Researchers of the St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPb FITS of the Russian Academy of Sciences) have learned to recognize human speech by lips using algorithms of artificial intelligence and “computer vision”. The development will help improve the accuracy of voice assistants in noisy environments, for example, in crowded places or when driving heavy machinery.

Today, systems that can recognize human speech (audio signal) for automated execution of commands are actively implemented in various fields, from cellular phones to combat helicopters. They are usually used by people with limb injuries or complex equipment operators who have busy hands. And recently, with the aim of improving user comfort, such systems are gaining more and more popularity in various business areas, gadgets and “smart” home systems with voice control.

Although modern recognition systems have advanced significantly in terms of accurate speech interpretation, their effectiveness can decrease dramatically under conditions of strong noise (loud sounds from equipment or crowded places).


“We have developed an application for a smartphone that recognizes spoken speech and lip-reads the user’s words by analyzing the video signal from the gadget’s camera. The program combines and analyzes information from two sources to improve recognition accuracy. Experiments have shown that such a hybrid system recognizes human commands much more effectively in difficult and noisy conditions,” says Denis Ivanko, a senior researcher at the Laboratory of Language and Multimodal Interfaces of the St. Petersburg FIS RAS.

According to him, the application works by analogy with the principle of the cognitive system of a person who, when talking in a noisy place, involuntarily begins to pay attention to the lips of the interlocutor, trying to read the lips of information that he could not hear. This feature is confirmed by scientific experiments, when people in noisy conditions were asked to recognize only sound or visual information. However, the group that received both types of data showed the best results.

The basis of the program is a neural network model, which was trained to recognize several hundred of the most common commands based on audiovisual signals (video recordings accompanied by sound). Moreover, according to scientists, the created neural network is able to perceive an audiovisual signal and automatically decide which data (video or sound or both) will give maximum accuracy during recognition.

During experiments, the application was used by drivers of noisy heavy trucks of one of the Russian logistics companies. For this, the software was installed on the smartphones of the test subjects. The accuracy of recognition of commands with only visual effects was 60-80%, and combinations with a sound signal – more than 90%.

“Also last year, at international scientific competitions, our model took first place in the world in terms of the accuracy of speech reading from the announcer’s lips. Participants trained their neural networks on an open database of English-language data consisting of 500,000 video recordings and tested them on a set of 25,000 recordings. The accuracy of our model turned out to be close to 90% recognition only considering the movements of the announcers’ lips. We assume that in the future our application can be used by pilots of airplanes and heavy industrial machinery or for use in interactive information kiosks in shopping centers and other places of mass gathering of people,” explains Denis Ivanko.

The research was supported by a grant from the Russian National Science Foundation (No. 21-71-00132). In addition, a state registration certificate was obtained for the developed software. The results of the project were also published in the materials of the European Signal Processing Conference (EUSIPCO).

The development project of this software is part of the large work of scientists of the St. Petersburg Physic Center of the Russian Academy of Sciences to create specialized automatic speech recognition systems. For example, researchers have previously developed an intelligent system that helps doctors communicate with deaf patients.

Related posts