Whisper (OpenAI) - Automatic English Subtitles for Any Film in Any Language - An Intro & Guide to Subtitling JAV

I've used faster-whisper WebUI with a VAD , large-v2 model, default parameters from author. Most accurate and detailed translation, but too many hallucination.
WhisperJAV default parameters from author, some lines missing(non-critical), a little bit less accurate. But author has done excellent job to remove hallucination.
 
  • Like
Reactions: mei2
I have an RTX 4090. Bought it for gaming, but it turns out the requirements for whisper are pretty much the same, so I decided to use it for that as well.

Currently I am not using any VAD, as I haven't found a good description how to install that yet.
 
I have an RTX 4090. Bought it for gaming, but it turns out the requirements for whisper are pretty much the same, so I decided to use it for that as well.
Currently I am not using any VAD, as I haven't found a good description how to install that yet.

Wow. With that rig you can crank-up the hyper-parameters to their highest quality settings. In that case I'd recommend playing with these values:

temperature (0.0, 0.2) (I don't like more than 0.2 temp as the model gets way too creative in making up stories :))
beam_size 10 (default is 5, some people have reported best results with up to 20)
best_of 10 (same as above)
patience 2

And of course, to keep the condition_on_previous_text to False.

Enjoy 4090 :)
 
@mei2

Do you find any benefits of Transcribe > DeepL over straight Translation? I feel like Translation is fine for the main themes, but it fails on any sexy talk or any innuendo. Am thinking of trying Transcribe > DeepL if it can do a better job during the hotter parts of a movie.
 
Do you find any benefits of Transcribe > DeepL over straight Translation? I feel like Translation is fine for the main themes, but it fails on any sexy talk or any innuendo. Am thinking of trying Transcribe > DeepL if it can do a better job during the hotter parts of a movie.

Benefits of translate: it is faster (it is weired but the model is faster in outputing translation than transcription --whish is because the way it was trained).

Benefits of transcribe -> DeepL: better qulity.

In terms of naughty language: the model for Whisper translation task is trained on Youtube videos, so clearly it is rather sanitised. To my taste, I like the DeepL route better. If you want more customised style, pro users can define custom glossary in DeepL and that usually helps with style -say more naughty language.
 
  • Like
Reactions: r00g
beam_size 10 (default is 5, some people have reported best results with up to 20)
best_of 10 (same as above)
patience 2
I tried with those settings. Seems to mostly work, but I see some weird results. For one thing, there was this:
1698269533222.png

Not sure why it thought a bunch of dogs was a proper translation.

Then there was:
948
01:14:57,200 --> 01:15:27,200
I'm going to take a break for a while, so I'm going to take a break for a while, so I'm going to take a break for a while, so I'm going to take a break for a while, so I'm going to take a break for a while, so I'm going to take a break for a while, so I'm going to take a break for a while, so I'm going to take a break for a while, so I'm going to take a break for a while, so I'm going to take a break for a while, so I'm going to take a break for a while, so I'm going to take a break for a while, so I'm going to take a break for a while, so I'm going to take a break for a while, so I'm going to take a break for a while, so I'm going to take a break for a while, so I'm going to take a break for a while, so I'm going to take a break for a while, so I'm going to take a break

Oddly enough while there was no talking at all, just some background music. Any idea what setting would get rid of that?

I guess I have more experimenting to do, though maybe I should get something like a 10 minute segment to try stuff out on, because every attempt to do the full movie does take a fair amount of time.
 
  • Haha
  • Like
Reactions: mei2 and r00g
I tried with those settings. Seems to mostly work, but I see some weird results. For one thing, there was this:


Oddly enough while there was no talking at all, just some background music. Any idea what setting would get rid of that?

I guess I have more experimenting to do, though maybe I should get something like a 10 minute segment to try stuff out on, because every attempt to do the full movie does take a fair amount of time.

Yep, those are hallucination. One can reduce the amount of hallucination by dialing up the VAD threshold. For example with Silero VAD, a threshold of 0.8 or 0.9 would filter much of those (ambigiously) non-speech segments. As for Whisper, as usual one has to keep condition_on_previous_text to False.and temperature to 0, to reduce halucination. Idealy the non_speech_threshold should help but unfortunately that implementation has not been reliable as explained by one of the lead developers in Whisper.

Of course, reducing halicination has the counter effetct that one misses some details. There is always a trade off.

For me, I prefer details. So what I do is I let hallucination through but I remove them by postprocessing. Although there are always some surprises. Like I hadn't seen the walking dog :)
 
I guess I can live with the hallucinations if it means more details as well.

I do need to do something about the occasional very long line, as a few times I eunded up with subtitles covering the entire screen, but those are rare and easily fixed manually.
 
FWIW, I have fixed few bugs and released version 0.6b of WhisperJAV. This version has better hallucination removal.


WhisperJAV Version 06b
Hey thanks! Love your work so great to see another update!

Bit of a weird request but is there any chance of a potential Runpod Template? - https://www.runpod.io/

For me I'm fine paying a little for efficient/fast translations but Collab premium seems a mess... no idea how "computation" units work...
 
Hey thanks! Love your work so great to see another update!

Bit of a weird request but is there any chance of a potential Runpod Template? - https://www.runpod.io/

For me I'm fine paying a little for efficient/fast translations but Collab premium seems a mess... no idea how "computation" units work...
Actually forget all that, your collab is lightning quick! did 3 full length movies in 20 min if that! Amazing :)
 
  • Like
Reactions: mei2
Off-topic: does anyone have a recommendation for torrent client? I'm using qBittorrent and am not sure about the speed.
 
qbittorrent should be good, it's based on libtorrent which is still actively developed so maybe you would be better off looking at optimizing the settings for your usage instead if you're unhappy with your speed.

With that said, my internet connection is slow so that's not something I ever had a problem with since it can easily max it and I do most of my torrenting on a headless server so the most obvious option for me is rtorrent since it has a command line interface. I do use qbittorrent in some special cases when rtorrent won't do but it's pretty rare.
 
There is now a large-v3 model out for Whisper, and I think it is a large improvement over v2. The model scorecards say it has ~15% improvement to Japanese but I think that the improvement is larger than that for JAV. The model is different enough that settings will have to change.

I am using whisper-ctranslate2. vad_threshold of 0.3 and repetition_penalty of 1.1 seem like a good start. Patience of 3 or so, Beam Size of 8 or 10. Temperature 0.2, Best Of 10 or so.

Still benefits from editing but plot-heavy Attackers movies get way more coherent results in the first pass.
 
  • Like
Reactions: mei2
I'm using regular whisper and no vad, as I haven't researched how to install that yet, and I can't say I see that much difference between large-v2 and large-v3.

Granted, I've tried only one movie, so maybe the difference is bigger in different circumstances.
 
There is now a large-v3 model out for Whisper, and I think it is a large improvement over v2. ....

Still benefits from editing but plot-heavy Attackers movies get way more coherent results in the first pass.
@panop857 do you do task transcribe or translate? There are quite a bit of commmets about large-v3 creating overwhelming amount of hallucination, what has your experience been? Thanks!
 
@panop857 do you do task transcribe or translate? There are quite a bit of commmets about large-v3 creating overwhelming amount of hallucination, what has your experience been? Thanks!
Translate. I think you get better results doing direct-to-english rather than transcribing to Japanese and then translating with some other method. It is like using Latent upscaling in Stable Diffusion getting way better results than upscaling from an otherwise final image.

repetition_penalty of 1.1 or 1.2 or even 1.3 can really crack down on hallucination. Basically, it helps signal to the model that if it is selecting the same solution to subsequent chunks, it is doing something wrong and needs to rethink what is offering. Hopefully that means just having silence, but with 2hr movies it is hard to get good settings that are good for the entire film at the same time.

There's still problems with it repeating the same three-line chunks in a row 2x or 3x, but I have not found good settings that suppress that.

I think condition_on_previous settings True is really important, but that biases the results in a way that can lead to hallucination, which makes the repetition_penalty extra important when using conditioning.
 
  • Like
Reactions: mei2
Are condition_on_previous and repetition_penalty vad things? I don't see them listed as an option on whisper itself.

While just whisper is enough to follow the story, especially once you get used to the standard mistakes it makes a lot, I'm wondering how much better vad makes the translation. is it worth figuring out how to install that on my computer or doesn't it improve the translation much?
 
Are condition_on_previous and repetition_penalty vad things? I don't see them listed as an option on whisper itself.

While just whisper is enough to follow the story, especially once you get used to the standard mistakes it makes a lot, I'm wondering how much better vad makes the translation. is it worth figuring out how to install that on my computer or doesn't it improve the translation much?

condition_on_previous_text is the feature on commandline Whisper.

VAD is a huge improvement because it is better at identifying when a longer thought is being expressed and translates the whole thing together. This makes it better at picking up on meanings and not just literal translation of expressions.

I use whisper-ctranslate2 in a terminal on Windows. Faster Whisper I think incorporates VAD, and that is what whisper-ctranslate2 uses.
 
  • Like
Reactions: mei2
I use whisper-ctranslate2 in a terminal on Windows. Faster Whisper I think incorporates VAD, and that is what whisper-ctranslate2 uses.

Good choice to go with whisper-ctranslate2. Especially now that the original developer of fater-whisper has left to join Apple, I've been wondering what would be a good branch to move to that is still actively being deveoped. I'm guessing that the contributors to whisper-ctranslate2 will continue to maintain the new (systran) branch of faster-whisper. I'll give whisper-ctranslate2 a try.

On a pedantic :) note: the option for repetition_penalty is a feature of faster-whisper, not present in original Whisper.
 
Last edited:
  • Like
Reactions: panop857