Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Open Source AI

OpenAI Open-Sources Whisper, a Multilingual Speech Recognition System (techcrunch.com) 15

Speech recognition remains a challenging problem in AI and machine learning. In a step toward solving it, OpenAI today open-sourced Whisper, an automatic speech recognition system that the company claims enables "robust" transcription in multiple languages as well as translation from those languages into English. TechCrunch reports: Countless organizations have developed highly capable speech recognition systems, which sit at the core of software and services from tech giants like Google, Amazon and Meta. But what makes Whisper different, according to OpenAI, is that it was trained on 680,000 hours of multilingual and "multitask" data collected from the web, which lead to improved recognition of unique accents, background noise and technical jargon.

"The primary intended users of [the Whisper] models are AI researchers studying robustness, generalization, capabilities, biases and constraints of the current model. However, Whisper is also potentially quite useful as an automatic speech recognition solution for developers, especially for English speech recognition," OpenAI wrote in the GitHub repo for Whisper, from where several versions of the system can be downloaded. "[The models] show strong ASR results in ~10 languages. They may exhibit additional capabilities ... if fine-tuned on certain tasks like voice activity detection, speaker classification or speaker diarization but have not been robustly evaluated in these area."

Whisper has its limitations, particularly in the area of text prediction. Because the system was trained on a large amount of "noisy" data, OpenAI cautions Whisper might include words in its transcriptions that weren't actually spoken -- possibly because it's both trying to predict the next word in audio and trying to transcribe the audio itself. Moreover, Whisper doesn't perform equally well across languages, suffering from a higher error rate when it comes to speakers of languages that aren't well-represented in the training data. Despite this, OpenAI sees Whisper's transcription capabilities being used to improve existing accessibility tools.

This discussion has been archived. No new comments can be posted.

OpenAI Open-Sources Whisper, a Multilingual Speech Recognition System

Comments Filter:
  • by war4peace ( 1628283 ) on Sunday September 25, 2022 @06:14AM (#62911895)

    I'd love to have access to a demo of sorts, where you could input a 10-second audio and see how well it fares transcribing that.

  • just call it the 'universal translator' ? Course they'd have to use the greatest computer voice ever, Majel Barrett

    (yeah I know the Star Trek version put it in their own voice but that's beyond our tech)

  • Finally! (Score:4, Funny)

    by devslash0 ( 4203435 ) on Sunday September 25, 2022 @08:34AM (#62912007)

    At long last I'll be able to finally understand India-based customer support.

    • It just takes practice, like any other accent. For a few years, I worked with a number of Indians at a large multi-national, and it got to be second nature for me. Amusingly, when my wife had a minor health scare - not the amusing part - her doctor was Indian, and I was able to follow along with everything he said. I looked at my wife to see if she understood what she'd been told, and she had a blank look on her face. I had to translate for her, which is freakin' hilarious given that I have to use a coc
      • That's another reason why I never go to any Indian health, legal or otherwise important professionals. I simply can't understand them and when it comes to important things in life, understanding and details are key.

      • Also, have you ever been afraid that you may speaking like them? It's not uncommon for people to change their accent over time when being regularly exposed to differently-sounding people. Your wife may wake up one day and realise she's got a new Asian husband. Or you'll start talking in Indian accent in your dream and it's going to be a little bit like "Dr Jekyll and Mr Hyde".

  • Trial Run (Score:5, Informative)

    by walkerp1 ( 523460 ) on Sunday September 25, 2022 @09:54AM (#62912085)
    I grabbed this to take a look. It took me about 20 minutes to completely install. Consider that I also installed git, gh, ffmpeg, and Python 3.10, as well as all the prerequisites in whisper.git. For my test, I grabbed episode 49 of the Hikaru no Go anime that I had lying around and trimmed out the first 30 seconds of dialog.

    ffmpeg -ss 0:1:40 -to 0:2:10 -i HikaruNoGo_49.mkv -map 0:a:0 -acodec copy test.flac

    Then I decided to give it a translation run to compare with the provided subtitles:

    whisper --model small --task translate --language Japanese test.flac

    I have a pretty new laptop, and the spectrogram creation took about 14 minutes and 460 Mb of space. I had first tried the large model, but it wanted an hour and a half plus 2+ Gb of space, so I killed that for now, hoping that the more resource-gentle small model would prove out well. I'm predicting that the actual translation took a very small fraction of the time relative to the transcription. That portion of the process took about 4 minutes. Next is the list of translations and subtitles for each:

    Let me...
    Let me...

    Let me win, Hikaru!
    Please let me play, Hikaru!

    N-No way!
    You've got to be kidding!

    If I let you win, I'm sure you'll win!
    If I let you play, you're going to win for sure!

    Shindou...
    Shindou...

    You've finally come this far.
    You've finally come all this way.

    I've been waiting for this moment.
    I've been waiting for this.

    Hande? (Note that this word was spoken in English, which no doubt confused the translator)
    A handicap?

    If you let Hande win, the way you shoot will change.
    He'll have to play differently if he has a handicap.

    I'll shoot you with that!
    I'll play with one!

    As you can see, even the small model worked out quite well (in my estimation). This looks like it'll be a fun tool to play around with.
  • What is mostly unsolved and will remain mostly unsolved for a long tome (and possibly forever) is speech understanding. Turns out that speech is not very redundant on a signal level and impossible to accurately recognize without understanding context and often also understanding the people speaking. Machines cannot do that and no amount of still doable increasing of the size of the statistical models will really help.

    • > Machines cannot do that

      Peple can't do that either, not all. Just some.
      • by gweihir ( 88907 )

        True. But many people are not capable of understanding language in general either. They can only understand what fits their specific, often deeply flawed model of the world.

  • As someone that has to deal with Indian colleagues, I hope this technology will allow me someday to understand what the hell they're trying to say.

"I've seen it. It's rubbish." -- Marvin the Paranoid Android

Working...