Greek scientist built a machine that can lip-read better than humans
Scientists at Oxford University, led by a Greek researcher, have developed a machine that can lip-read better than humans.
Yannis Assael, Brendan Shillingford, Shimon Whiteson and Nando de Freitas used deep learning AI to create LipNet – software that reads lips faster and more accurately than has previously been possible. Although LipNet has proven to be very promising, it is still at a relatively early stage of development. It has been trained and tested on a research dataset of short, formulaic videos that show a well-lit person face-on.
The artificial intelligence system – LipNet – watches video of a person speaking and matches the text to the movement of their mouths with 93% accuracy, the researchers said. Automating the process could help millions, they suggested.
But experts said the system needed to be tested in real-life situations.
Lip-reading is a notoriously tricky business with professionals only able to decipher what someone is saying up to 60% of the time.
“Machine lip-readers have enormous potential, with applications in improved hearing aids, silent dictation in public spaces, covert conversations, speech recognition in noisy environments, biometric identification and silent-movie processing,” wrote the researchers.
But the team says its goal is to train with real world examples, with Yannis Assael, one of the researchers on the project, writing that ‘performance will only improve with more data’.
They said that the AI system was provided with whole sentences so that it could teach itself which letter corresponded to which lip movement.
To train the AI, the team – from Oxford University’s AI lab – fed it nearly 29,000 videos, labelled with the correct text. Each video was three seconds long and followed a similar grammatical pattern.
While human testers given similar videos had an error rate of 47.7%, the AI had one of just 6.6%. The fact that the AI learned from specialist training videos led some on Twitter to criticise the research. Writing in OpenReview, Neil Lawrence pointed out that the videos had “limited vocabulary and a single syntax grammar”.
“While it’s promising to perform well on this data, it’s not really groundbreaking. While the model may be able to read my lips better than a human, it can only do so when I say a meaningless list of words from a highly constrained vocabulary in a specific order,” he writes.
The project was partially funded by Google’s artificial intelligence firm DeepMind.