Researchers Build An AI That's Better At Reading Lips Than Humans (bbc.com) 62

Posted by EditorDavid on Saturday March 18, 2017 @11:34AM from the just-like-HAL dept.

An anonymous reader quotes the BBC: Scientists at Oxford say they've invented an artificial intelligence system that can lip-read better than humans. The system, which has been trained on thousands of hours of BBC News programs, has been developed in collaboration with Google's DeepMind AI division. "Watch, Attend and Spell", as the system has been called, can now watch silent speech and get about 50% of the words correct. That may not sound too impressive - but when the researchers supplied the same clips to professional lip-readers, they got only 12% of words right...
The system now recognizes 17,500 words, and one of the researchers says, "As it keeps watching TV, it will learn."

Researchers Build An AI That's Better At Reading Lips Than Humans

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 62 Comments Log In/Create an Account

Comments Filter:

- Re: (Score:3)
  
  by toonces33 ( 841696 ) writes:
  
  Well, there is "Bad Lip Reading" - their videos are usually pretty funny.
- Re: (Score:3)
  
  by JustAnotherOldGuy ( 4145623 ) writes:
  
  'the same clips to professional lip-readers"
  ok, who else didn't know that there are "professional" lip readers?
  The police use them from time to time (on surveillance videos). I imagine there are other uses as well.
17 years too late (Score:5, Insightful)

by Anonymous Coward writes: on Saturday March 18, 2017 @11:39AM (#54065483)

I'm sorry Dave, I'm afraid I can't do that.

- Re: (Score:3)
  
  by hcs_$reboot ( 1536101 ) writes:
  
  16 actually (a comment to remove the Funny mod, that should be Insightful instead)
Great way to get flushed down the airlock! [n/t] (Score:2)

by mobby_6kl ( 668092 ) writes:

N/T
- Re: Maybe /. needs an AI ... (Score:1)
  
  by Anonymous Coward writes:
  
  with all the AI job obsolescence going on the universal income one is pretty much relevant
perfect opportunity (Score:3)

by v1 ( 525388 ) writes: on Saturday March 18, 2017 @11:48AM (#54065541) Homepage Journal

Sseeing as there's so much closed-captioning going on, they've got an enormous volume of material to train their neural network on.
I've done this sort of thing before, and often finding a large set of quality training material is a significant challenge.
Getting half the words correct, then feeding that into a grammar / context engine should yield very close to 100% accuracy. That's what deaf (and hearing impaired) lip readers have to do since the stated 12% initial recognition is about right. They have to stay very focused on the speaker and make heavy use of context to work out what's being said. And that's a perfect job for a computer.

- Re: (Score:2)
  
  by BarbaraHudson ( 3785311 ) writes:
  
  The closed-captioning does speech-to-text, not lip reading. It's advanced to the point that you can dictate your SMS messages more reliably than fumbling around with an on-screen keyboard and auto-uncorrect.
  - Re: (Score:2)
    
    by ShanghaiBill ( 739463 ) writes:
    
    The closed-captioning does speech-to-text, not lip reading.
    Sure, but if it did both, the error rate would go way down.
    - Re: (Score:2)
      
      by jordanjay29 ( 1298951 ) writes:
      
      Sort of? Consider how many times dialogue is spoken off-camera, such as a voice-over or cutaway reaction shot, or when the speaker is simply not facing the camera. Your reliability in those cases are cut in half anyway without the advantage of being able to lip read.
      - Re: (Score:2)
        
        by JohnFen ( 1641097 ) writes:
        
        Also consider how frequently the captions differ from the actual spoken words.
        
        Re: (Score:2)
        
        by jordanjay29 ( 1298951 ) writes:
        
        This can happen for a number of reasons, actually. Sometimes it's an actual mistake, but also possible is a rephrasing of the line to make it easier to caption or easier to understand. Since captioning is most often geared towards Deaf people, and many grew up with English as a second language, some idioms and turns of phrase can seem out of context and aren't as appropriate for captions. There are some who bristle at this attempt at hand-holding and think captions should be 100% accurate to dialogue, while
        
        Re: (Score:2)
        
        by JohnFen ( 1641097 ) writes:
        
        Yes, I understand. But the fact that the captions and the spoken words often differ limits the effectiveness of combining captions and lip reading to reduce the error in machine translations. It doesn't matter much why the captions and the spoken words differ.
  - Re: (Score:2)
    
    by jordanjay29 ( 1298951 ) writes:
    
    What are you talking about? Closed captioning for most media is manually entered and synced with the time. Speech-to-text captions (like those on YouTube) have far less accuracy, although sometimes they put real-time captioning (think televised news) to shame. But most of what you see on TV and everything on DVDs is written and checked by a human, and is not entirely reliant on STT transcription.
    - Re: (Score:2)
      
      by BarbaraHudson ( 3785311 ) writes:
      
      Closed captioning for live events (such as news) is text-to-speech. Easily detectable if you read the captions and listen to the words - the mistakes aren't from typos, but closely sounding words. Manually entered also takes a few seconds delay, same as simultaneous translation is not really simultaneous, there's a second or so delay (but the translator can often anticipate what's about to be said by context - and then when they goof, you get to hear it when they correct themselves).
  - Re: (Score:2)
    
    by v1 ( 525388 ) writes:
    
    The closed-captioning does speech-to-text, not lip reading.
    
    Closed Captioning is the transmission of text of what is being said along with the video and audio stream. It's up to the receiver to do text to speech.
    The benefit of CC here is that you have the "problem" (the video of the speaker) AND the "answer" (the text that they spoke) to work with, and this is precisely what you require to train a neural network. A large volume of problems and correct solutions. "When you get THIS input, you are suppose
    - Re: (Score:2)
      
      by BarbaraHudson ( 3785311 ) writes:
      
      There's no need to do text-to-speech if you're already transmitting the audio stream, duh!
- Re: (Score:2)
  
  by stephanruby ( 542433 ) writes:
  
  Would each closed-captioned syllable or word need to be manually synchronized with the video first? Or can the training be done without it?
  Getting half the words correct, then feeding that into a grammar / context engine should yield very close to 100% accuracy.
  But this AI is already using context to some degree. The article gives the example of "Prime Minister" for instance, where the AI knows that if the word "Prime" is read on their lips, that the word "Minister" will probably follow. Also, the AI has been trained in one context alone, which means that the context is already taken into account. For instance, if the same anch
straight from the related links (Score:1)

by Anonymous Coward writes:

https://tech.slashdot.org/story/16/11/25/1146258/googles-deepmind-made-an-ai-watch-close-to-5000-videos-so-that-it-surpasses-humans-in-lip-reading?sdsrc=rel
But the wild card walks in (Score:1)

by Anonymous Coward writes:

Sees the computer AI progressing in its research, and decides to replace the movies being watched, with the complete collection of gojira monster films that were dubbed in English and hardly provided any syncing at all, circa 1960's era, followed by Chinese martial arts movies full of lines like "Yaaaaa!" " Huh?" and "Prepare to die!"
The icing on the cake is when he throws in an Inspector Clouseau film
The surveillance state (Score:4, Insightful)

by JustAnotherOldGuy ( 4145623 ) writes: on Saturday March 18, 2017 @11:59AM (#54065591) Journal

The surveillance state is coming in its pants thinking about all the additional conversations they'll be able to monitor now.
Time to break out the bandannas and cough-masks....soon it'll be fashionable to wear them in public!

- Re: (Score:3, Insightful)
  
  by fustakrakich ( 1673220 ) writes:
  
  soon it'll be fashionable to wear them in public!
  And illegal
That cry of dismay ... (Score:2)

by BarbaraHudson ( 3785311 ) writes:

That cry of dismay was the sound of thousands of blind gynecologists realizing they will be out of a job reading lips. :-)
Of course the reality is grim - even more surveillance by marketers and the state - especially with TVs and webcams and (if you believe Trump) microwaves watching everything you say and do.
- - Re: (Score:2)
    
    by BarbaraHudson ( 3785311 ) writes:
    
    if you believe Trump
    You should! He never lies, and he's always right
    Except when his lips move ....
- Re: (Score:2)
  
  by jordanjay29 ( 1298951 ) writes:
  
  The irony of your comment is that silent movies generally used title cards for their dialogue anyway, making it equally accessible no matter if you could hear or not.
Professional lip readers are bunk. (Score:3)

by Khyber ( 864651 ) writes: <techkitsune@gmail.com> on Saturday March 18, 2017 @12:19PM (#54065693) Homepage Journal

Go compare this to a deaf person that reads lips. I know of literally thousands that never miss a single spoken word as long as they're looking at the speaker's mouth.
Source: Camfrog, where there are fucktons of deaf people communicating with those with hearing. We speak after getting their attention with a hand signal, they read our lips and reply with zero issues.

- Re: (Score:2)
  
  by JohnFen ( 1641097 ) writes:
  
  This is true. I once had a conversation with someone and was very surprised to later learn that the person was completely deaf. I had no clue.
Based on "2001", I thought it would be better (Score:2)

by mykepredko ( 40154 ) writes:

Or was Frank Poole killed because HAL thought they were going to unplug the "Mammary Circus" and that was basically the only DVD the three of them could agree on watching?
need good info to train the AI (Score:2)

by frovingslosh ( 582462 ) writes:

I'm wondering what text they are using to train the AI about what was said. I sure hope it isn't the closed captioning text on the news broadcasts. In my experience that is only about 50% accurate itself.
Round peg, meet round hole (Score:4, Interesting)

by yodleboy ( 982200 ) writes: on Saturday March 18, 2017 @12:40PM (#54065791)

Why don't they offer to run this against the thousands of hours of course videos that Berkley just pulled due to ADA? Google gets massive training material, Berkley gets free transcripts, and the material stays online. Everyone wins...

- Re: (Score:2)
  
  by BarbaraHudson ( 3785311 ) writes:
  
  Because Berkeley lied when they said that they had to provide transcripts or remove the material. Section 107 of the copyright act 1976 [copyright.gov] allows for fair use for teaching materials, and this allows 3rd parties to make available all such materials in more accessible forms, and for Berkeley to use the results of such work.
  They weren't interested in doing this. It's about monetization and artificial scarcity, pure and simple. This was just a smokescreen to remove the material.
  The blind will be using TTS screen
  - - Re: (Score:2)
      
      by BarbaraHudson ( 3785311 ) writes:
      
      They could have pointed out that since these are fair use materials, there are agencies whose mandate is to make them ADA compliant . They didn't.
- Re: (Score:2)
  
  by Barnoid ( 263111 ) writes:
  
  Why don't they offer to run this against the thousands of hours of course videos that Berkley just pulled due to ADA? Google gets massive training material, Berkley gets free transcripts, and the material stays online. Everyone wins...
  Good idea, but unfortunately it won't work in this case. Many of UCBerkeley's lecture videos only show the slides and you hear the lecturer talk. See, for example, https://www.youtube.com/watch?... [youtube.com].
Learning through TV (Score:2)

by JohnFen ( 1641097 ) writes:

"As it keeps watching TV, it will learn."
When TV was first being introduced as a consumer product, one of the selling points of the idea was that people would be able to learn by watching it. If this works out as well as that, then the system will only be able to recognize when someone is uttering lines from commercials.
- Re: (Score:2)
  
  by jordanjay29 ( 1298951 ) writes:
  
  That accounts for the 50% rate, that's about how many commercials are captioned.
Spying Concerns (Score:1)

by ssufficool ( 1836898 ) writes:

At least I know it won't be able to read my lips. You see, I speak American, not English.
Not surprising (Score:2)

by nitehawk214 ( 222219 ) writes:

Humans are very difficult to read.
Try this line, Mr. AI lipreader (Score:2)

by cellocgw ( 617879 ) writes:

Did he just say "No new taxes," or did he say "No Newt[Gingrich] Axes" ?
Heck you were even told, prior to that line, "read my lips," so you got no excuses.
Duplicate, and old (Score:2)

by udif ( 32355 ) writes:

https://tech.slashdot.org/story/16/11/25/1146258/googles-deepmind-made-an-ai-watch-close-to-5000-videos-so-that-it-surpasses-humans-in-lip-reading
Quiet Man (Score:1)

by sfsp ( 655361 ) writes:

Maybe we'll finally find out what John Ford told Maureen O'Hara to say John Wayne...a secret all three took to their graves...

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Re: (Score:3)

Re: (Score:3)

17 years too late (Score:5, Insightful)

Re: (Score:3)

Great way to get flushed down the airlock! [n/t] (Score:2)

Re: Maybe /. needs an AI ... (Score:1)

perfect opportunity (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

straight from the related links (Score:1)

But the wild card walks in (Score:1)

The surveillance state (Score:4, Insightful)

Re: (Score:3, Insightful)

That cry of dismay ... (Score:2)

Re: (Score:2)

Re: (Score:2)

Professional lip readers are bunk. (Score:3)

Re: (Score:2)

Based on "2001", I thought it would be better (Score:2)

need good info to train the AI (Score:2)

Round peg, meet round hole (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Learning through TV (Score:2)

Re: (Score:2)

Spying Concerns (Score:1)

Not surprising (Score:2)

Try this line, Mr. AI lipreader (Score:2)

Duplicate, and old (Score:2)

Quiet Man (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals