Follow Slashdot blog updates by subscribing to our blog RSS feed


Forgot your password?
AI United Kingdom

Researchers Build An AI That's Better At Reading Lips Than Humans ( 62

An anonymous reader quotes the BBC: Scientists at Oxford say they've invented an artificial intelligence system that can lip-read better than humans. The system, which has been trained on thousands of hours of BBC News programs, has been developed in collaboration with Google's DeepMind AI division. "Watch, Attend and Spell", as the system has been called, can now watch silent speech and get about 50% of the words correct. That may not sound too impressive - but when the researchers supplied the same clips to professional lip-readers, they got only 12% of words right...
The system now recognizes 17,500 words, and one of the researchers says, "As it keeps watching TV, it will learn."
This discussion has been archived. No new comments can be posted.

Researchers Build An AI That's Better At Reading Lips Than Humans

Comments Filter:
  • 17 years too late (Score:5, Insightful)

    by Anonymous Coward on Saturday March 18, 2017 @11:39AM (#54065483)

    I'm sorry Dave, I'm afraid I can't do that.

  • by v1 ( 525388 ) on Saturday March 18, 2017 @11:48AM (#54065541) Homepage Journal

    Sseeing as there's so much closed-captioning going on, they've got an enormous volume of material to train their neural network on.

    I've done this sort of thing before, and often finding a large set of quality training material is a significant challenge.

    Getting half the words correct, then feeding that into a grammar / context engine should yield very close to 100% accuracy. That's what deaf (and hearing impaired) lip readers have to do since the stated 12% initial recognition is about right. They have to stay very focused on the speaker and make heavy use of context to work out what's being said. And that's a perfect job for a computer.

    • The closed-captioning does speech-to-text, not lip reading. It's advanced to the point that you can dictate your SMS messages more reliably than fumbling around with an on-screen keyboard and auto-uncorrect.
      • The closed-captioning does speech-to-text, not lip reading.

        Sure, but if it did both, the error rate would go way down.

        • Sort of? Consider how many times dialogue is spoken off-camera, such as a voice-over or cutaway reaction shot, or when the speaker is simply not facing the camera. Your reliability in those cases are cut in half anyway without the advantage of being able to lip read.
          • Also consider how frequently the captions differ from the actual spoken words.

            • This can happen for a number of reasons, actually. Sometimes it's an actual mistake, but also possible is a rephrasing of the line to make it easier to caption or easier to understand. Since captioning is most often geared towards Deaf people, and many grew up with English as a second language, some idioms and turns of phrase can seem out of context and aren't as appropriate for captions. There are some who bristle at this attempt at hand-holding and think captions should be 100% accurate to dialogue, while
              • Yes, I understand. But the fact that the captions and the spoken words often differ limits the effectiveness of combining captions and lip reading to reduce the error in machine translations. It doesn't matter much why the captions and the spoken words differ.

      • What are you talking about? Closed captioning for most media is manually entered and synced with the time. Speech-to-text captions (like those on YouTube) have far less accuracy, although sometimes they put real-time captioning (think televised news) to shame. But most of what you see on TV and everything on DVDs is written and checked by a human, and is not entirely reliant on STT transcription.
        • Closed captioning for live events (such as news) is text-to-speech. Easily detectable if you read the captions and listen to the words - the mistakes aren't from typos, but closely sounding words. Manually entered also takes a few seconds delay, same as simultaneous translation is not really simultaneous, there's a second or so delay (but the translator can often anticipate what's about to be said by context - and then when they goof, you get to hear it when they correct themselves).
      • by v1 ( 525388 )

        The closed-captioning does speech-to-text, not lip reading.

        Closed Captioning is the transmission of text of what is being said along with the video and audio stream. It's up to the receiver to do text to speech.

        The benefit of CC here is that you have the "problem" (the video of the speaker) AND the "answer" (the text that they spoke) to work with, and this is precisely what you require to train a neural network. A large volume of problems and correct solutions. "When you get THIS input, you are suppose

    • Would each closed-captioned syllable or word need to be manually synchronized with the video first? Or can the training be done without it?

      Getting half the words correct, then feeding that into a grammar / context engine should yield very close to 100% accuracy.

      But this AI is already using context to some degree. The article gives the example of "Prime Minister" for instance, where the AI knows that if the word "Prime" is read on their lips, that the word "Minister" will probably follow. Also, the AI has been trained in one context alone, which means that the context is already taken into account. For instance, if the same anch

  • by Anonymous Coward

  • by Anonymous Coward

    Sees the computer AI progressing in its research, and decides to replace the movies being watched, with the complete collection of gojira monster films that were dubbed in English and hardly provided any syncing at all, circa 1960's era, followed by Chinese martial arts movies full of lines like "Yaaaaa!" " Huh?" and "Prepare to die!"

    The icing on the cake is when he throws in an Inspector Clouseau film

  • by JustAnotherOldGuy ( 4145623 ) on Saturday March 18, 2017 @11:59AM (#54065591)

    The surveillance state is coming in its pants thinking about all the additional conversations they'll be able to monitor now.

    Time to break out the bandannas and cough-masks....soon it'll be fashionable to wear them in public!

  • That cry of dismay was the sound of thousands of blind gynecologists realizing they will be out of a job reading lips. :-)

    Of course the reality is grim - even more surveillance by marketers and the state - especially with TVs and webcams and (if you believe Trump) microwaves watching everything you say and do.

  • Go compare this to a deaf person that reads lips. I know of literally thousands that never miss a single spoken word as long as they're looking at the speaker's mouth.

    Source: Camfrog, where there are fucktons of deaf people communicating with those with hearing. We speak after getting their attention with a hand signal, they read our lips and reply with zero issues.

    • This is true. I once had a conversation with someone and was very surprised to later learn that the person was completely deaf. I had no clue.

  • Or was Frank Poole killed because HAL thought they were going to unplug the "Mammary Circus" and that was basically the only DVD the three of them could agree on watching?

  • I'm wondering what text they are using to train the AI about what was said. I sure hope it isn't the closed captioning text on the news broadcasts. In my experience that is only about 50% accurate itself.
  • by yodleboy ( 982200 ) on Saturday March 18, 2017 @12:40PM (#54065791)
    Why don't they offer to run this against the thousands of hours of course videos that Berkley just pulled due to ADA? Google gets massive training material, Berkley gets free transcripts, and the material stays online. Everyone wins...
    • Because Berkeley lied when they said that they had to provide transcripts or remove the material. Section 107 of the copyright act 1976 [] allows for fair use for teaching materials, and this allows 3rd parties to make available all such materials in more accessible forms, and for Berkeley to use the results of such work.

      They weren't interested in doing this. It's about monetization and artificial scarcity, pure and simple. This was just a smokescreen to remove the material.

      The blind will be using TTS screen

    • by Barnoid ( 263111 )

      Why don't they offer to run this against the thousands of hours of course videos that Berkley just pulled due to ADA? Google gets massive training material, Berkley gets free transcripts, and the material stays online. Everyone wins...

      Good idea, but unfortunately it won't work in this case. Many of UCBerkeley's lecture videos only show the slides and you hear the lecturer talk. See, for example, [].

  • "As it keeps watching TV, it will learn."

    When TV was first being introduced as a consumer product, one of the selling points of the idea was that people would be able to learn by watching it. If this works out as well as that, then the system will only be able to recognize when someone is uttering lines from commercials.

  • At least I know it won't be able to read my lips. You see, I speak American, not English.

  • Humans are very difficult to read.

  • Did he just say "No new taxes," or did he say "No Newt[Gingrich] Axes" ?

    Heck you were even told, prior to that line, "read my lips," so you got no excuses.


  • Maybe we'll finally find out what John Ford told Maureen O'Hara to say John Wayne...a secret all three took to their graves...

Executive ability is deciding quickly and getting somebody else to do the work. -- John G. Pollard