Whisper (OpenAI) - Automatic English Subtitles for Any Film in Any Language - An Intro & Guide to Subtitling JAV

panop857

Well-Known Member
Sep 11, 2011
173
251
63
OpenAI made massive waves in 2022 when they released Dall-E 2 and ChatGPT, but that year they also released Whisper for Automatic Speech Recognition, meaning transcripts. Not just transcripts, but direct-to-English transcripts from any of 97 different languages, including Japanese. In practice, these are a high watermark for auto transcriptions, and will only get better with time like most of these AI techniques.

What this means is that you (yes, YOU) can, without collaborators, generate and edit English subtitles for your favorite JAV.

Some kind souls have put up free portals either on Google Colab or to use up their own GPU cycles that they aren't using. Otherwise, you will need to use Python or the command line.

I'll update this post and the FAQ based on feedback and discussion. Better to have an early version of this out than to sit on it until I never perfect it.

How it Works

Like the other recent AI models, this is a giant Neural Network (NN) trained on a huge amount of diverse data. There's a bunch of moving parts to this, but what sets it apart from other transcriptions.
  1. The model is trained on 680,000 hours of video across 97 different languages, split up into 30 second chunks.
  2. Translates directly from the source language to English subtitles, leading to some better interpretations.
  3. Easy to set up in Python for both CPU or GPU. Get Anaconda, and the command line from inside the Anaconda Launcher is going to be sufficient for people with even limited programming experience.
  4. Various online portals of people loaning out their spare GPU time so you can do Medium or Large models.
  5. Small produces reasonable results off a CPU, and the Medium model runs off of semi-recent GPUs. The Large model is too large to run for even avid gamers, and there is no way to cheat the hard VRAM requirements.
  6. Built-in Voice Detection systems and timing. The Google Colab that people use implements a more consistent Voice Detection system, which likely also leads to better subtitle timings.
  7. The basic flow of Whisper is that it attempts to identify single speaking segments of a single speaker, and attempts to determine the most likely transcription of that dialogue chunk. It is giving probabilities to a whole bunch of different possibilities, and then choosing the one chunk that seems to be the most likely. Among those probabilities is a prediction for "this segment isn't actually speech", and many of the parameters you choose for the model are choosing what it does with this probabilities.
  8. While you can technically throw a full film as the target file to transcribe/translate, it is better to make a separate file that is just the 100MB-200MB audio file. Use MKVToolNix or FFMPEG.
  9. Can be run via Python script, notebook, or command line. I prefer command line. Notebooks will have problems with freeing up the VRAM.
  10. Japanese is one of the better languages for this by standard transcription and translation metrics, but dirtiness and pronouns will be a frequent problem to work around.
  11. SubtitleEdit allows for direct generation of subtitles without having to learn Python. This is convenient, because SubtitleEdit is also the best tool for editing and revising the generated subtitles from Whisper.


Strengths of Whisper

⦁ Makes it feasible for one person to generate subtitle files for what used to be a multi-person job. One person, with Whisper, can produce timed subtitle files with in English from any language.
⦁ Does a remarkably good job at discerning Japanese names and plot details. For something like Attackers movies, this is one of the things that people are most interested in, and this ends up being comparable to or better than most of the subs you'll find online.
⦁ The subs can be a lot more complete. Most JAV subs just don't bother translating a lot of the lesser text, but if you're calibrating the detecton thresholds, you will have far better coverage of speech than you're used to seeing.
⦁ Easily outclasses the flood of English subtitles that are based on Chinese hardcoded subtitles that have been flooding the web for the last two years.

Weaknesses of Whisper

⦁ While there are undoubtedly going to be many sites in the future that use Whisper transcriptions without editing, a single editing pass is going to greatly improve the quality and readability. It is easy to catch badly interpreted lines when doing a test run.
⦁ Whisper is bad at "talking dirty". There are likely ways to fix this, but it isn't hard to see why the training data sets might veer away from filthy language.
⦁ For Japanese-to-English transcriptions, the models that can run on a CPU really don't cut it. You need a semi-recent GPU to be able to run the Medium model, and the Large model that produces by far the best results is just totally out of the price range for nearly all casual users. As such, most of my advice in this thread is going to be for maximizing the quality of the Medium models.
⦁ Japanese is a genuinely difficult language to get pronouns right. This is something that can mostly be corrected during the editing watch.
⦁ Some of the tweaking parameters can be difficult to intuit what is a good value, and there may be substantially different parameters between what is good for a new movie and and what is good for something from a decade ago.



Parameter Tuning

Whisper has a bunch of parameters and the jargon descriptions do not make much sense. Whatever their original intent is, there is likely different set parameters
Most of these parameters have to deal with adjusting how it interprets the different probabilities associated to transcriptions or the absence of speech altogether.

Task: ='translate' or 'transcribe'. Transcribe will output the source's the language (so Japanese-to-Japanese transcriptions), while Translate means the output will be the English translation.
no_speech_threshold: How certain it needs to be for

logprob_threshold: Some log-scale metric for detection when there is no speech. The default is -1.0, but it goes up on a log scale so 0, +0.5, +1.5, +10 are all valid values? I have not gotten a good intuition for this.

compression_ratio_threshold: 2.4 default. Some measure of how much the transcription needs to be distinct and not just the same line over and over in a way that would compress too well. I think? No idea how to intuit this.

temperature: A measure of how much randomness goes into the transription and translaton process. This seems unintuitive at first, but doing a bunch of initial conditiosn and seeing what comes out, and comparing all of the probabilities and selection the best, produces better outcomes.

best_of: if Temperature is not 0, tells it how many times to attempt to translate/transcribe the segment.

beam_size: Default of 5, only used when Temperature is 0. Some measure of how broad of a search to do for the best solution, with larger values needing more VRAM?
condition_on_previous_text: Defaults to True, only other option is False at the the time of this writing. This setting encourages the model to learn a specific topic, a specific style, or discern more consistent translations of proper nouns. I strongly recommend you use False when translating JAV. Since JAV can have many portions that are difficult to detect and transcribe and translate (not enough JAV in their training data), having this set to True leads to some areas where the same translation is used line after line, leading to such a strong bias towards some line that the translation may never recover. If you're being greedy, it may be worthwhile to translate a film twice, once with True and once with False, and then manually picking the best translations for each line. There are strengths to both.

initial_prompt: You can enter a sentence or series of words as a string to try to bias the translation in some way. It is not clear whether this is supposed to applay to only the first segment or the entire transcript, but in future versions it will probably be a bit more clear or adaptable.
suppress_tokens: There are some words or phrases or special characters that are ignored. There may be some way to either undo that or introduce new suppressed words, but I do not know how to use this correctly.

language: 'ja' for Japaese, not 'jp'. Whisper automatically detects the language, or is supposed to but, just be up front when using it on JAV.



Useful references:

Discussion on choosing parameters:

The original paper:

Setup Guide from user SamKook: This is from September so maybe the process has improved.

Anaconda - A Standard Python Install for Windows or Other OS:

Installing all of the command line and Python stuff - IMO do this in the Anaconda Terminal that you can open from inside the Anaconda Launcher.



FAQ (will expand)

Q: I have a problem with getting this or Python set up on my computer.
A: I can't help you, sorry. None of the steps are hard but there can be compatability issues that require either really knowing what you are doing or just restarting the entire installation process. Just be sure to make sure you are using Cuda if you want to leverage your GPU.

Q: Can I post some example Subtitles here for help?
A: Yes, I think that is a good use of this thread. Showing off the process and what gets better results is useful.
 
Last edited:
I have a frequent problem of the first 30 seconds not having translations or having bad translations, and then the chunk after the 30 second mark to be rushed and horribly mistimed.
 
I have a frequent problem of the first 30 seconds not having translations or having bad translations, and then the chunk after the 30 second mark to be rushed and horribly mistimed.
Can you post your hyperparameters? That would be helpful in debugging your issue. Here's what I have been using:
Code:
whisper \
--model large-v2 \
--verbose False \
--task translate \
--language Japanese \
--temperature 0 \
--beam_size 5 \
--patience 1.0 \
--condition_on_previous_text False \
--fp16 True \
--temperature_increment_on_fallback 0.2 \
--compression_ratio_threshold 2.5 \
--logprob_threshold -1 \
--no_speech_threshold 0.08 \
 
Can you post your hyperparameters? That would be helpful in debugging your issue. Here's what I have been using:
Code:
whisper \
--model large-v2 \
--verbose False \
--task translate \
--language Japanese \
--temperature 0 \
--beam_size 5 \
--patience 1.0 \
--condition_on_previous_text False \
--fp16 True \
--temperature_increment_on_fallback 0.2 \
--compression_ratio_threshold 2.5 \
--logprob_threshold -1 \
--no_speech_threshold 0.08 \
No Speech at 0.08 is very low. Do you get lots of phantom speech?
 
Yea, and using 0.08 still does not cure that issue. Might be because I am using large-v2, but I always see a fair amount of hallucination in my translations.
 
Yea, and using 0.08 still does not cure that issue. Might be because I am using large-v2, but I always see a fair amount of hallucination in my translations.

Damn, that probably is a result of the large-v2. I don't really understand the interplay between No Speech and Logprob. I think hallucination may end up solved via Logprob but I don't know what values to even guess at for it.
 
Wasn't aware that a fine tuned model is more difficult to run - will have to read up on it as I have been slowly gathering references on how to put a data set together with the intent of building something that can be used to fine tune.
 
Wasn't aware that a fine tuned model is more difficult to run - will have to read up on it as I have been slowly gathering references on how to put a data set together with the intent of building something that can be used to fine tune.

As far as I can tell, you can't just run things with the simple command line after that. You need to start worrying about the different parts of the process. Not sure if that has changed or if it will change, but the documentation does not describe the process well.

There is logit biasing or additional training data as possible solutions, but I have not made much sense of it.
 
I think this thread will be a good place to debug subtitle files, but it would be a lot easier if we were allowed to upload .vtt and or .srt files. I can't imagine there being a reason why not.

Otherwise, standard practice should be to rename them to .txt so they can be opened and viewed uncompressed.
 
Do I use --condition_on_previous_text False or --condition_on_previous_text=False or do both work?
 
Do I use --condition_on_previous_text False or --condition_on_previous_text=False or do both work?

If you are using the command line (which I assume is what you are doing based on the "--", it is "--condition_on_previous_text False". You do not do any equal sign on the command line.
 
Yes, I'm using the command line. Thanks.
 
  • Like
Reactions: panop857
False for "condition on previous text" lets Whisper start doing things like subtitling (crying) and (grunting), but I also get (speaking in foreign language). No shit, Whisper? That is what I am here for!

That is some of the value in running Whisper multiple times per project. One of them will pick up an actual translation instead of (speaking in foreign language).
 
Setting it to False did help with all the repeated lines. Haven't actually watched the movie yet, just skimmed through the subtitles so not sure yet how good or bad they are. Probably not great as my current computer can't handle the large model, medium is the highest.
 
  • Like
Reactions: panop857 and r00g
My best luck so far has been no_speech_threshold at 0.6, log at -1.0, and condition_on_previous_text as False. Beam size something like 8 or 10.

This is a good general use for recent HD videos. I imagine that lower video quality could use other solutions.

From there, open it up in SubtitleEdit and identify the weird or bad lines and watch the video to see what they should be. SubtitleEdit feels like it was made for editing auto transcriptions due to how everything is laid out.
 
Just a little PSA. I've been using Faster Whisper through Faster Whisper Webui (https://huggingface.co/aadnk) for some days now, and I really recommend trying it out.

I have a 3080 and I've been using large-v2 as model, and with faster whisper I'm cutting from 20 to 60+ minutes on tasks, using way less GPU memory, and producing so much less heat :)

I can also push settings more now because I don't run out of VRAM.

Since it uses less VRAM, it also means that people who doesn't have 10 GB VRAM can use large-v2. RTX 2060 6GB seems to run it smoothly according to a comment on Faster Whisper Webui community tab

Faster Whisper (https://github.com/guillaumekln/faster-whisper):

faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is a fast inference engine for Transformer models.

This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.
 
Just a little PSA. I've been using Faster Whisper through Faster Whisper Webui (https://huggingface.co/aadnk) for some days now, and I really recommend trying it out.

I have a 3080 and I've been using large-v2 as model, and with faster whisper I'm cutting from 20 to 60+ minutes on tasks, using way less GPU memory, and producing so much less heat :)

I can also push settings more now because I don't run out of VRAM.

Since it uses less VRAM, it also means that people who doesn't have 10 GB VRAM can use large-v2. RTX 2060 6GB seems to run it smoothly according to a comment on Faster Whisper Webui community tab

Is there a way to run this via the command line as simply as regular Whisper? I have a ton of problems with trying to do things via regular Python code. I encounter way more VRAM issues, etc and it doesn't spit out all of the different subtitle formats automatically.


EDIT: Looks like that is what this is trying to do?

Looks like trying to get that running totally broke my ability to run regular Whisper, so I guess I'm out of the game for the foreseeable future. Looks like library pathing is a disaster for whisper-ctranslate2, so you need to jump through a bunch of hoops. Not even remotely as simple as regular commandline whisper. This seems to be something that pretty much everybody is crashing so I imagine it will get fixed.
 
Last edited:
Is 8 GB enough to run large with this version? The normal whisper won't go above medium with 8 GB. I don't really care that much about the higher speed, but being able to use a bigger language model would be great.