Open the pod bay doors, please, HAL: Meta's AI is like lip reading
It is widely known that people hear speech not only by listening with their ears but also by picking up signals from verbal movements that they see from speakers.
Similarly, combining visual acuity with audio can help a computer to better study human speech. In a sense, computer programs can read bills, even though it is hard work to invent.
Recentlythe work of Meta, the parent of Facebook, Instagram and WhatsApp, suggests a more efficient way to day when computers can read bills as well as HAL 9000 when Dr. David Bowman and Dr. Frank Poole tried the empty their audio sensors inside the pod in the movie "2001."
Last Friday, Meta artificial intelligence scientists published a study report in which they were able to significantly reduce the effort required to engineer software to parse the lip movements of speakers in recorded videos. The work could also use lip-reading technology to significantly improve speech recognition in sound environments.
The program is "75% more accurate than the best audio-visual speech recognition systems (which use both the speaker's sound and images to understand what the speaker is saying)," the -authorities say.
Of course, here's the Metaverse angle: Not only could the program be used for instant translation, at some point, it could also “help generate real-time lip movements in avatars meaningful, to deliver a real sense of presence - that feeling of being with someone even if they're on the other side of the world. "
The work represents progress in two lines. One is self-directed learning, which avoids certain clues, such as text transcripts, and instead has a seamless data structure of the program. The other area of development is called multimodal neural networks, which combine data of different types in such a way that they complement each other.
The product, known as AV-HuBERT, is the "AV" that stands for audio-visual, the "Hu" that stands for "hidden unit," combining audio and visual signals to find words from billing motions.
Lead author Bowen Shi and his colleagues Wei-Ning Hsu, Kushal Lakhotia, and Abdelrahman Mohamed from Facebook presented their paper, “Learning to Produce Audio-Visual Speech With Multimodal Cluster Prediction,” on the pre-print server -arXiv hit last Friday. The authors also wrote a blog post that may be easier for you to circulate.
As Shi & Co explained, previous work has also been multimodal, combining visual data, video frames with audio data, waveform clips to train a neural network to predict how they will react. match.
But programs like this have tended to rely on extra prepared signals, such as transcripts of videos of speakers into text sentences that then become labels. The new work goes the self-directed path, combining patterns in a fun and unstructured exterior.
"This is the first system to model speech and lip movements from anonymous data - a raw video that has not already been rewritten," the authors wrote in their blog post.
Many previous models read bill-reading videos with word-level notes, "for training," which are expensive to collect because they require word-of-sight information. Unlike these models, our models are fully trained from the very beginning using the recommended approach.
The AV-HuBERT program they created builds on an audio-only program called HuBERT that was introduced last year by Hsu and colleagues. As the name implies, HuBERT will use the two-way Transformer neural network approach developed at Google in 2022.
By "hiding" parts of an audio recording, which means leaving sections of audio format, the HuBERT neural network at its training stage had to recreate which audio pieces go together .
Now, at AV-HuBERT, Shi and the team “put together audio clips with frames from videos of people talking. The training phase of the cloud network progresses largely in two stages. First, like the original audio-only HuBERT, they use the attention-grabbing method to hide the audio and then place these audio formats in collections, which collect examples that are somehow close to each other in their virtues.
These groups then become a target for the second phase of the neural network. The multimodal part of AV-HuBERT simultaneously hides both images of speakers lips and audio waveform and then attempts to match them to the collections established in the first wave. In this way, the program measures which billing resolutions correspond to which audio waves, thus "learning" the correlation of verbal movement and audio output.
That is, effectively, a self-directed approach that designs a structure without obvious advertisements.
The combination means that the attention given to image frames and those placed on waveforms reinforce each other to produce better collections than either would be on their own. These collections are going to be the “target” of subsequent activities, such as lip reading and speech recognition.
As the authors explain,
AV-HuBERT simultaneously captures linguistic and phonological information for untested segments from both the lipmovement and audio streams into its covert productions, and then encodes the long-term relationships to solve find the hidden prediction function.
Once AV-HuBERT is self-trained in this way, the authors refine well by introducing a real video with labels, off hours, with formal transcripts that tell the machine where the words in the video are.
The main data set used for testing and training is the AV-HuBERT LRS3 program, developed in 2022 by Triantafyllos Afouras and colleagues in Oxford, which is “the set most publicly available sentence level bill reading data to date. It takes over 400 hours. of a video, taken from TED & TEDx speeches in English from YouTube. "
As a result of the self-directed training at AV-HuBERT, it can predict the words from speaker videos better than all previous attempts, Shi and company wrote.
However, more important than the raw score is the significant reduction in the amount of data required to train the program.
“AV-HuBERT achieves the latest standard using 433 hours of text transcripts, two orders of magnitude less than 31,000 hours of data with labels used in the best way before,” they say. writing.
With far less data required, it is possible to perform bill-reading tasks in languages that have much less data than others, known as low-resource languages. (Consider languages other than English, French and German, for example.)
The authors state that "As a future work, AV-HuBERT can be applied for multilingual bill reading in low-resource languages," and that the same "approach" extend to other applications of visual speech production, such as speech enhancement and generation. "
Shi and colleagues supplemented their findings with a second paper posted last week outlining the use of AV-HuBERT for automatic speech recognition. Here, the focus is on how to better parse speech in the context of sound.
Speech recognition “is used in meeting situations under the influence of babble sound, and one used in a home environment will naturally encounter music, cooking or empty machine sounds. ”Their research is whether AV-HuBERT can overcome the noise of such an environment.
During the training, Shi and the team mix audio clips with AV-HuBERT video frame and audio waveform samples. The result, they write, is that the program is getting good at getting around the gap. So much so that AV-HuBERT accumulates a 50% reduction in the word error rate, or WER, the proportion of wrong words, compared to previous speech recognition systems.
"Our future work involves the use of audio-visual recognition in very low-resource and multilingual settings," they write.
So, how real is something like reading HAL 9000 bills? The idea that AI is now better than people at reading bills was written about a few years ago by the work of AI before. AV-HuBERT's best display error rate is, in fact, far better than professional, human lip readers, at 26.9%. Apparently, the best readership of human bills is only 40% (they are wrong four times in ten.) Obviously, for things like post-fact speech transcripts, this could be a great boost for software programs.
In practice, however, there is a great deal. This is real symbolizing reading bills. AV-HuBERT's results pass a test of tin video, not live chat, free format, in the wild like Bowman and Poole's in the movie.
For now, you may still be safe inside the pot.