OpenAI Open-Sources Whisper, a Multilingual Speech Recognition System (techcrunch.com) 15
Speech recognition remains a challenging problem in AI and machine learning. In a step toward solving it, OpenAI today open-sourced Whisper, an automatic speech recognition system that the company claims enables "robust" transcription in multiple languages as well as translation from those languages into English. TechCrunch reports: Countless organizations have developed highly capable speech recognition systems, which sit at the core of software and services from tech giants like Google, Amazon and Meta. But what makes Whisper different, according to OpenAI, is that it was trained on 680,000 hours of multilingual and "multitask" data collected from the web, which lead to improved recognition of unique accents, background noise and technical jargon.
"The primary intended users of [the Whisper] models are AI researchers studying robustness, generalization, capabilities, biases and constraints of the current model. However, Whisper is also potentially quite useful as an automatic speech recognition solution for developers, especially for English speech recognition," OpenAI wrote in the GitHub repo for Whisper, from where several versions of the system can be downloaded. "[The models] show strong ASR results in ~10 languages. They may exhibit additional capabilities ... if fine-tuned on certain tasks like voice activity detection, speaker classification or speaker diarization but have not been robustly evaluated in these area."
Whisper has its limitations, particularly in the area of text prediction. Because the system was trained on a large amount of "noisy" data, OpenAI cautions Whisper might include words in its transcriptions that weren't actually spoken -- possibly because it's both trying to predict the next word in audio and trying to transcribe the audio itself. Moreover, Whisper doesn't perform equally well across languages, suffering from a higher error rate when it comes to speakers of languages that aren't well-represented in the training data. Despite this, OpenAI sees Whisper's transcription capabilities being used to improve existing accessibility tools.
"The primary intended users of [the Whisper] models are AI researchers studying robustness, generalization, capabilities, biases and constraints of the current model. However, Whisper is also potentially quite useful as an automatic speech recognition solution for developers, especially for English speech recognition," OpenAI wrote in the GitHub repo for Whisper, from where several versions of the system can be downloaded. "[The models] show strong ASR results in ~10 languages. They may exhibit additional capabilities ... if fine-tuned on certain tasks like voice activity detection, speaker classification or speaker diarization but have not been robustly evaluated in these area."
Whisper has its limitations, particularly in the area of text prediction. Because the system was trained on a large amount of "noisy" data, OpenAI cautions Whisper might include words in its transcriptions that weren't actually spoken -- possibly because it's both trying to predict the next word in audio and trying to transcribe the audio itself. Moreover, Whisper doesn't perform equally well across languages, suffering from a higher error rate when it comes to speakers of languages that aren't well-represented in the training data. Despite this, OpenAI sees Whisper's transcription capabilities being used to improve existing accessibility tools.
Hard to test (Score:3)
I'd love to have access to a demo of sorts, where you could input a 10-second audio and see how well it fares transcribing that.
Re: (Score:3, Informative)
https://whisper-openai.vercel.... [vercel.app]
Re: (Score:2)
But I don't need to record my own voice (I'll obviously know what I am saying).
Uploading a short audio file is what I was looking for.
Scottish Medical Specialists (Score:3)
Can they (Score:2)
just call it the 'universal translator' ? Course they'd have to use the greatest computer voice ever, Majel Barrett
(yeah I know the Star Trek version put it in their own voice but that's beyond our tech)
Finally! (Score:4, Funny)
At long last I'll be able to finally understand India-based customer support.
Re: (Score:2)
Re: (Score:2)
That's another reason why I never go to any Indian health, legal or otherwise important professionals. I simply can't understand them and when it comes to important things in life, understanding and details are key.
Re: (Score:2)
Also, have you ever been afraid that you may speaking like them? It's not uncommon for people to change their accent over time when being regularly exposed to differently-sounding people. Your wife may wake up one day and realise she's got a new Asian husband. Or you'll start talking in Indian accent in your dream and it's going to be a little bit like "Dr Jekyll and Mr Hyde".
Re: (Score:2)
you may start speaking like them*
I wish there was an edit option on slashdot.
Trial Run (Score:5, Informative)
ffmpeg -ss 0:1:40 -to 0:2:10 -i HikaruNoGo_49.mkv -map 0:a:0 -acodec copy test.flac
Then I decided to give it a translation run to compare with the provided subtitles:
whisper --model small --task translate --language Japanese test.flac
I have a pretty new laptop, and the spectrogram creation took about 14 minutes and 460 Mb of space. I had first tried the large model, but it wanted an hour and a half plus 2+ Gb of space, so I killed that for now, hoping that the more resource-gentle small model would prove out well. I'm predicting that the actual translation took a very small fraction of the time relative to the transcription. That portion of the process took about 4 minutes. Next is the list of translations and subtitles for each:
Let me...
Let me...
Let me win, Hikaru!
Please let me play, Hikaru!
N-No way!
You've got to be kidding!
If I let you win, I'm sure you'll win!
If I let you play, you're going to win for sure!
Shindou...
Shindou...
You've finally come this far.
You've finally come all this way.
I've been waiting for this moment.
I've been waiting for this.
Hande? (Note that this word was spoken in English, which no doubt confused the translator)
A handicap?
If you let Hande win, the way you shoot will change.
He'll have to play differently if he has a handicap.
I'll shoot you with that!
I'll play with one!
As you can see, even the small model worked out quite well (in my estimation). This looks like it'll be a fun tool to play around with.
Actually speech recognition is easy (Score:2)
What is mostly unsolved and will remain mostly unsolved for a long tome (and possibly forever) is speech understanding. Turns out that speech is not very redundant on a signal level and impossible to accurately recognize without understanding context and often also understanding the people speaking. Machines cannot do that and no amount of still doable increasing of the size of the statistical models will really help.
Re: (Score:2)
Peple can't do that either, not all. Just some.
Re: (Score:2)
True. But many people are not capable of understanding language in general either. They can only understand what fits their specific, often deeply flawed model of the world.
Indian? (Score:2)