
FFmpeg 8 Can Now Subtitle Your Videos on the Fly (theregister.com) 32
FFmpeg 8.0 brings GPU-accelerated video encoding via Vulkan -- and can now subtitle your videos automatically using integrated speech recognition. From a report: At the start of the week, the FFmpeg project released its eighth major version. It's codenamed "Huffman" after the Huffman code algorithm, which was invented in 1952, making it one of the oldest lossless compression algorithms.
[...] The changelog lists 30 significant changes, of which the top new feature is integrating Whisper. This means whisper.cpp, which is Georgi Gerganov's entirely local and offline version of OpenAI's Whisper automatic speech recognition model. The bottom line is that FFmpeg can now automatically subtitle videos for you.
[...] The changelog lists 30 significant changes, of which the top new feature is integrating Whisper. This means whisper.cpp, which is Georgi Gerganov's entirely local and offline version of OpenAI's Whisper automatic speech recognition model. The bottom line is that FFmpeg can now automatically subtitle videos for you.
Shit (Score:1)
Looks like ffmpeg is the latest enshittification victim.
Re:Shit (Score:5, Interesting)
At least it's local and offline.
Re:Shit (Score:5, Interesting)
Not entirely.
Whisper actually works rather well in several specific use cases, and fails spectacularly in others. You need to know this in advance:
- Whisper is roughly 90% accurate at transcription and translation
- Whisper absolutely does not know what to do with silence and will randomly inject "subtitled by (fansub group, netflix, etc)" into silence
- Whisper does not really understand singing well
- Whisper does not understand code-switching (eg switching between English and Japanese in the same context window)
- Whisper understands zero onomatopoeia, just like all ASR systems.
With that said, it is not useful or reliable for:
1. Fansubbing, especially anything adult. It can only understand words, not onomatopoeia. So when it stumbles into a scene where someone goes "ah!" it has zero context for it. The result is actually pretty silly, and often turns sex scenes in R-rated and unrated media into a series of random gibberish words that begin with the same sound. Likewise children playing and women giggling often turns it into a series of nonsense, sometimes sexually charged words.
2. Transcription of podcasts. Sorry bub, your average podcaster has a shitty microphone, and can not subtitle when multiple people are speaking over each other. Especially when people use Zoom or Discord to have a multi-party video. If you want to use it to transcribe a podcast, record each participant separately and merge the result.
3. ASR technology is often built on corpus of bad data that elevates profanity when it tries to guess words it can not understand. So it's more likely to use racist language "trigger" becomes the same word with an n, that isn't even in the audio. So your input source must be professional grade, or it's word error rate will be higher and favor profanity or racist language over other more less-often but more obvious words.
I doubt most people will use this in practice as Whisper.cpp is insanely slow without being expressly used on a 16GB nvidia GPU anyway.
Re: (Score:2, Interesting)
The points you mention sound like they are drawbacks of the available language models, not of the used whisper library.
Re: Shit (Score:1)
Re: Shit (Score:2)
Re: (Score:2)
I am still confused. I can look up the real meaning, but what is the common misconception meaning?
Re: (Score:2)
"- Whisper absolutely does not know what to do with silence and will randomly inject "subtitled by (fansub group, netflix, etc)" into silence"
While this may be a design problem with Whisper, it should be easy to avoid in ffmpeg. If silence is detected, do not generate subtitles. Not the scientific solution, but a working one.
Re:Shit (Score:5, Informative)
I wrote this article.
I don't think so, no. It's a local feature, not online, entirely optional, and you are perfectly free to ignore it, not turn it on, and use FFmpeg as before.
The size of the binary of FFmpeg is a rounding error compared to the many gigabytes of the video files it takes as input and emits. If you do not enable the Whisper model I am not even sure it'll take any additional memory at runtime.
Re: (Score:3)
If people want to develop free and open source AI, it's better than just leaving it to self-interested corporations. Provided it's not forced on people.
If it is as good as Youtube, I'll pass (Score:5, Insightful)
Youtube's automatic subtitling is a piece of junk.
Re: (Score:2)
Youtube's automatic subtitling is a piece of junk.
Yeah, it really is. I'd have modded you up, but no points.
Re: (Score:2)
It does many mistakes, perhaps too many - this can't be denied, but it can still be a life saver for people with poor hearing or poor command of spoken English.
Re: (Score:3)
Re: If it is as good as Youtube, I'll pass (Score:3)
Re: (Score:2)
Re: (Score:2)
Depends on the speaker.
Not on Wayland (Score:1)
It's gotta be better than live sports subtitles (Score:2)
I wish they'd go ahead and switch to some kind of automated subtitles already. The human subtitlers do an amazing job for a human, but they often get several sentences behind what's actually going on. If AI can subtitle live events, keeping the words on the screen in sync with what's being said, I'd welcome that even if it got a few more words wrong (which I doubt would happen, the humans get a lot of words wrong, and miss a lot as well).
Re: It's gotta be better than live sports subtitle (Score:4, Informative)
Well, unless it is Netflix. They clearly wanted it done cheap. Spelling mistakes, sloppy translations,... It is rushed work.
Re: (Score:2)
Indeed. As my subject line noted, I'm referring to live events. In the US, subtitles are also very good for prerecorded content.
Re: (Score:2)
Start by making the actual ones work (Score:2)
Yesterday I even tried merging avi and srt into an mkv and even that wouldn't display the subtitle. WTF ?!?
memory? (Score:2)
How big is this new AI monstrosity? A few terabytes maybe?
Re: (Score:2)
What did you do to find out? I mean you could for example try Google, Perplexity, maybe even ChatGPT could be asked about the size of the Whisper model.