Whisper (OpenAI) - Automatic English Subtitles for Any Film in Any Language - An Intro & Guide to Subtitling JAV

Is there a way to run this via the command line as simply as regular Whisper? I have a ton of problems with trying to do things via regular Python code. I encounter way more VRAM issues, etc and it doesn't spit out all of the different subtitle formats automatically.


EDIT: Looks like that is what this is trying to do?

Looks like trying to get that running totally broke my ability to run regular Whisper, so I guess I'm out of the game for the foreseeable future. Looks like library pathing is a disaster for whisper-ctranslate2, so you need to jump through a bunch of hoops. Not even remotely as simple as regular commandline whisper. This seems to be something that pretty much everybody is crashing so I imagine it will get fixed.
That sucks. Was about to suggest using Anaconda for separate environments to avoid things like this, but I see it's already mentioned in your OP.

I linked to https://github.com/guillaumekln/faster-whisper in my post, but as mentioned, I run it through Faster Whisper Webui. I haven't tested it, but you can run CLI through Whisper Webui/Faster Whisper Webui - "You can also run the CLI interface, which is similar to Whisper's own CLI but also supports the following additional arguments" - https://huggingface.co/spaces/aadnk/faster-whisper-webui/blob/main/README.md

It also have an API that can be used.

Is 8 GB enough to run large with this version? The normal whisper won't go above medium with 8 GB. I don't really care that much about the higher speed, but being able to use a bigger language model would be great.
I would think so. As mentioned, there's been at least one report of one user with 6 GB VRAM that can run large-v2. Depending on settings, I see my that my VRAM usage is from minimum 3,5 GB to 5,1 GB.
 
OpenAI made massive waves in 2022 when they released Dall-E 2 and ChatGPT, but that year they also released Whisper for Automatic Speech Recognition, meaning transcripts. Not just transcripts, but direct-to-English transcripts from any of 97 different languages, including Japanese. In practice, these are a high watermark for auto transcriptions, and will only get better with time like most of these AI techniques.

What this means is that you (yes, YOU) can, without collaborators, generate and edit English subtitles for your favorite JAV.

Some kind souls have put up free portals either on Google Colab or to use up their own GPU cycles that they aren't using. Otherwise, you will need to use Python or the command line.

I'll update this post and the FAQ based on feedback and discussion. Better to have an early version of this out than to sit on it until I never perfect it.

How it Works

Like the other recent AI models, this is a giant Neural Network (NN) trained on a huge amount of diverse data. There's a bunch of moving parts to this, but what sets it apart from other transcriptions.
  1. The model is trained on 680,000 hours of video across 97 different languages, split up into 30 second chunks.
  2. Translates directly from the source language to English subtitles, leading to some better interpretations.
  3. Easy to set up in Python for both CPU or GPU. Get Anaconda, and the command line from inside the Anaconda Launcher is going to be sufficient for people with even limited programming experience.
  4. Various online portals of people loaning out their spare GPU time so you can do Medium or Large models.
  5. Small produces reasonable results off a CPU, and the Medium model runs off of semi-recent GPUs. The Large model is too large to run for even avid gamers, and there is no way to cheat the hard VRAM requirements.
  6. Built-in Voice Detection systems and timing. The Google Colab that people use implements a more consistent Voice Detection system, which likely also leads to better subtitle timings.
  7. The basic flow of Whisper is that it attempts to identify single speaking segments of a single speaker, and attempts to determine the most likely transcription of that dialogue chunk. It is giving probabilities to a whole bunch of different possibilities, and then choosing the one chunk that seems to be the most likely. Among those probabilities is a prediction for "this segment isn't actually speech", and many of the parameters you choose for the model are choosing what it does with this probabilities.
  8. While you can technically throw a full film as the target file to transcribe/translate, it is better to make a separate file that is just the 100MB-200MB audio file. Use MKVToolNix or FFMPEG.
  9. Can be run via Python script, notebook, or command line. I prefer command line. Notebooks will have problems with freeing up the VRAM.
  10. Japanese is one of the better languages for this by standard transcription and translation metrics, but dirtiness and pronouns will be a frequent problem to work around.


Strengths of Whisper

⦁ Makes it feasible for one person to generate subtitle files for what used to be a multi-person job. One person, with Whisper, can produce timed subtitle files with in English from any language.
⦁ Does a remarkably good job at discerning Japanese names and plot details. For something like Attackers movies, this is one of the things that people are most interested in, and this ends up being comparable to or better than most of the subs you'll find online.
⦁ The subs can be a lot more complete. Most JAV subs just don't bother translating a lot of the lesser text, but if you're calibrating the detecton thresholds, you will have far better coverage of speech than you're used to seeing.
⦁ Easily outclasses the flood of English subtitles that are based on Chinese hardcoded subtitles that have been flooding the web for the last two years.

Weaknesses of Whisper

⦁ While there are undoubtedly going to be many sites in the future that use Whisper transcriptions without editing, a single editing pass is going to greatly improve the quality and readability. It is easy to catch badly interpreted lines when doing a test run.
⦁ Whisper is bad at "talking dirty". There are likely ways to fix this, but it isn't hard to see why the training data sets might veer away from filthy language.
⦁ For Japanese-to-English transcriptions, the models that can run on a CPU really don't cut it. You need a semi-recent GPU to be able to run the Medium model, and the Large model that produces by far the best results is just totally out of the price range for nearly all casual users. As such, most of my advice in this thread is going to be for maximizing the quality of the Medium models.
⦁ Japanese is a genuinely difficult language to get pronouns right. This is something that can mostly be corrected during the editing watch.
⦁ Some of the tweaking parameters can be difficult to intuit what is a good value, and there may be substantially different parameters between what is good for a new movie and and what is good for something from a decade ago.



Parameter Tuning

Whisper has a bunch of parameters and the jargon descriptions do not make much sense. Whatever their original intent is, there is likely different set parameters
Most of these parameters have to deal with adjusting how it interprets the different probabilities associated to transcriptions or the absence of speech altogether.

Task: ='translate' or 'transcribe'. Transcribe will output the source's the language (so Japanese-to-Japanese transcriptions), while Translate means the output will be the English translation.
no_speech_threshold: How certain it needs to be for

logprob_threshold: Some log-scale metric for detection when there is no speech. The default is -1.0, but it goes up on a log scale so 0, +0.5, +1.5, +10 are all valid values? I have not gotten a good intuition for this.

compression_ratio_threshold: 2.4 default. Some measure of how much the transcription needs to be distinct and not just the same line over and over in a way that would compress too well. I think? No idea how to intuit this.

temperature: A measure of how much randomness goes into the transription and translaton process. This seems unintuitive at first, but doing a bunch of initial conditiosn and seeing what comes out, and comparing all of the probabilities and selection the best, produces better outcomes.

best_of: if Temperature is not 0, tells it how many times to attempt to translate/transcribe the segment.

beam_size: Default of 5, only used when Temperature is 0. Some measure of how broad of a search to do for the best solution, with larger values needing more VRAM?
condition_on_previous_text: Defaults to True, only other option is False at the the time of this writing. This setting encourages the model to learn a specific topic, a specific style, or discern more consistent translations of proper nouns. I strongly recommend you use False when translating JAV. Since JAV can have many portions that are difficult to detect and transcribe and translate (not enough JAV in their training data), having this set to True leads to some areas where the same translation is used line after line, leading to such a strong bias towards some line that the translation may never recover. If you're being greedy, it may be worthwhile to translate a film twice, once with True and once with False, and then manually picking the best translations for each line. There are strengths to both.

initial_prompt: You can enter a sentence or series of words as a string to try to bias the translation in some way. It is not clear whether this is supposed to applay to only the first segment or the entire transcript, but in future versions it will probably be a bit more clear or adaptable.
suppress_tokens: There are some words or phrases or special characters that are ignored. There may be some way to either undo that or introduce new suppressed words, but I do not know how to use this correctly.

language: 'ja' for Japaese, not 'jp'. Whisper automatically detects the language, or is supposed to but, just be up front when using it on JAV.



Useful references:

Discussion on choosing parameters:

The original paper:

Setup Guide from user SamKook: This is from September so maybe the process has improved.

Anaconda - A Standard Python Install for Windows or Other OS:

Installing all of the command line and Python stuff - IMO do this in the Anaconda Terminal that you can open from inside the Anaconda Launcher.



FAQ (will expand)

Q: I have a problem with getting this or Python set up on my computer.
A: I can't help you, sorry. None of the steps are hard but there can be compatability issues that require either really knowing what you are doing or just restarting the entire installation process. Just be sure to make sure you are using Cuda if you want to leverage your GPU.

Q: Can I post some example Subtitles here for help?
A: Yes, I think that is a good use of this thread. Showing off the process and what gets better results is useful.
Hello Again
thanks for posting this it looks great. No time to have a proper look now unfortunately but will do over the next few days. I have many plot points I want to try and subscribe!
 
  • Like
Reactions: panop857
That sucks. Was about to suggest using Anaconda for separate environments to avoid things like this, but I see it's already mentioned in your OP.

I linked to https://github.com/guillaumekln/faster-whisper in my post, but as mentioned, I run it through Faster Whisper Webui. I haven't tested it, but you can run CLI through Whisper Webui/Faster Whisper Webui - "You can also run the CLI interface, which is similar to Whisper's own CLI but also supports the following additional arguments" - https://huggingface.co/spaces/aadnk/faster-whisper-webui/blob/main/README.md

It also have an API that can be used.


I would think so. As mentioned, there's been at least one report of one user with 6 GB VRAM that can run large-v2. Depending on settings, I see my that my VRAM usage is from minimum 3,5 GB to 5,1 GB.
I find the commandline for regular Whisper way easier to use, and I want to avoid having to upload files due to how many subs I do.

I will likely take a stab at using the WebUI at some point, or just wait for ctranslate2 to get fixed-- there is no reason for it to not just use whatever paths you have set up.

I need to be more careful with Anaconda and use it properly. Using that on a Windows machine with the windows commandline rather than Unix is something I am not used to but am now doing on my GPU machine.
 
Last edited:
I have been using this model thus far, and it works great for real conversations. However, during intimate moments, the model tends to miss transcribing some small lines.

Another issue is that Google Colab limits the amount of computing power you can use within a certain time frame. I am unsure how to utilize the Colab runtime with my local GPU.
 
whisper-ctranslate2 does the Large-v2 model when I could only use medium before. Lots of downsides though.

  • I spent a huge amount of time dealing with Cuda paths. I eventually had to just copy-paste all of the Cuda dll files from my torch site-packages to the ctranslate folder, which is bad practice and could lead to errors.
  • Large v2, or maybe this, has far more substantial phantom speech and repetition errors than original Whisper with medium.
  • The default settings are totally different, including that it no longer saves subtitles.
It at least has VAD support. I have no clue how or why they decided to not just use the same defaults or package paths or even directory and output paths of Whisper. This is wildly harder to use for only slight benefit.
 
Last edited:
I think it's a mixture between both the model and VAD, but mostly the model. Technically VAD should actually help though. I got around the same amount of hallucinations with VAD and large-v2 with "regular" whisper as well. Hallucination is a somewhat frequent subject in the whisper github, and also mentioned in https://github.com/openai/whisper/blob/main/model-card.md..

On the "positive" side, if you transcribe instead of translate, I feel like there's less randomness to the hallucinations, making it easier to clean before doing a complete translation with deepl. Random english word might pop up, but that's very easy to spot compared to a translation to english. And when it comes to sentences, they're usually different versions of "thanks for watching, like and subscribe" :D

Code:
    content = content.replace("本日はご覧いただきありがとうございます。良い一日を!", "_replace_") #Thank you for watching today.
    content = content.replace("本日はご覧いただきありがとうございます。良い1日を!", "_replace_") #Thank you for watching today.
    content = content.replace("本日もご視聴いただきありがとうございます。 良い一日を!", "_replace_") #Thank you for watching today. have a nice day!
    content = content.replace("本日もご視聴いただきありがとうございます。", "_replace_") #Thanks for watching my video.
    content = content.replace("本日はご視聴ありがとうございました。", "_replace_") #Thank you for watching
    content = content.replace("本日もご視聴ありがとうございました", "_replace_") #Thanks for watching my video.
    content = content.replace("今日のビデオはここまでです。私のビデオを見てくれてありがとう、私はあなたを愛して", "_replace_") #Thanks for watching
 
Last edited:
whisper-ctranslate2 does the Large-v2 model when I could only use medium before. Lots of downsides though.

  • I spent a huge amount of time dealing with Cuda paths. I eventually had to just copy-paste all of the Cuda dll files from my torch site-packages to the ctranslate folder, which is bad practice and could lead to errors.
  • Large v2, or maybe this, has far more substantial phantom speech and repetition errors than original Whisper with medium.
  • The default settings are totally different, including that it no longer saves subtitles.
It at least has VAD support. I have no clue how or why they decided to not just use the same defaults or package paths or even directory and output paths of Whisper. This is wildly harder to use for only slight benefit.
using this package will get you a much cleaner commandline solution that can optionally call into faster-whisper


once you clone the git, you can be up and running w/ two commands:

Code:
pip install -r requirements.txt
pip install -r requirements-fasterWhisper.txt
 
Last edited:
using this package will get you a much cleaner commandline solution that can optionally call into faster-whisper


once you clone the git, you can be up and running w/ two commands:

Code:
pip install -r requirements.txt
pip install -r requirements-fasterWhisper.txt

Cuda, in particular, seems like there's a whole lot of potential problems when using mish mashes of DLLs, which is likely to happen when you have to start using workarounds to deal with different pathing standards. All of these upgrades versions of Whisper have easily ten times the overall expected set up time compared to getting the base version working without error. I did a full reinstall of Cuda, but still have a bunch of path issues on Windows.
 
Last edited:
SubtitleEdit has built-in support for generating Whisper subtitles, so you do not even need to do your own Python setups.
 
The good news is that whisper-ctranslate2 works on the commandline for Faster Whisper now, as long as you aren't using temperature settings. It even includes VAD settings.

I can't find good settings that work. I am trying to use length_penalty to prevent the long lines, or repetitive lines, but I have not been able to find good settings that get the strengths of the Large models without the big downsides of repetitive lines or lines that are just several minutes of "Oh oh oh oh oh oh".
 
Not sure if there's a more general AI translation thread but Google has demonstrated full video to video translation that translates to speech in the target language and changes the mouth movements to match.

Google Universal Translator
 
After a lot of setup - managed to get this working last night with small files. The results are pretty good but I don't speak any Japanese so they could be off. So many issues with CUDA and the pathing + extra DLLs like Zlib needed which isn't explained in the Github repos.

After getting it working (not through CMD as it doesn't save a SRT?), I decided to have a crack at the small files using CTranslate2 (medium) within Subtitle Edit. It worked well so for the big test I did a 2 hour video - again using Subtitle Edit and CTranslate2, but this time the large-v2 model. This looks like it failed because after 8 hours its still transcribing audio. I could leave it going but its either bugged out or its just too much for my 3070. I'm now trying again on the two hour video but with the medium model again and its working and will probably take 20-30 mins in total. So its a shame the large model didn't.

EDIT:
Tried large again and its working now with large and didnt bug out. Took about 30 mins. Quality difference wasn't a huge leap on first glance between medium to large.
 
Last edited:
  • Like
Reactions: panop857 and mei2
I agree that large is not vastly better than medium, but should be used if possible.

The various thresholds and settings that produce good results

I think a Large model with temperature of something like 0.1 or 0.2 and a high "best_of" (10? 20?) seems like it produces very good results.

My problem now is getting the VAD to work. VAD uses Silero in whisper-ctranslate2, but I do not have a good idea of what settings to use. A lower VAD threshold seems like it picks up less dialogue, which is not what I would intuitively think.

vad_threshold of 0.8 or so seems like it good for most situations.
 
Large-v2 using vad_threshold with default settings looks like it gets good results. Specifying a vad_threshold seems to lead to worse results.

Not sure what the default for vad_threshold is, or if it forces a more rigid view rather than something more dynamic that picks up on different volume levels.
 
I picked up bits and pieces and have put together this implementation based on faster-whisper.
I do quite a bit of post processing to remove the hallucination, repetition and long timestamps.
Give it a try. I want to improve it more so any comments, ideas, and suggestions are very welcomed.

WhisperJav

 
I picked up bits and pieces and have put together this implementation based on faster-whisper.
I do quite a bit of post processing to remove the hallucination, repetition and long timestamps.
Give it a try. I want to improve it more so any comments, ideas, and suggestions are very welcomed.

WhisperJav


Great job!
What have you done to remove the hallucination?
 
Great job!
What have you done to remove the hallucination?

The usual: hyper-parameters and then post-processing. You can see in the code the values of hyper parameters. I think the most effective one is temperature=0. The post processing treats Japanese and English differetly. I wrote them as functions so if anyone wants to reuse them that is quite easy.
 
Finally got around to installing whisper on my new computer. Did a test run on a 2 hour video. It took almost half an hour to run and left me with some questions. I haven't actually watched the movie with the subs yet, so I don't know how good or bad the translations actually are yet, so my questions are about the technical parts.

I used model large-v2, is this indeed the best one?

Should I use any other settings beside temperature 0 and condition_on_previous_text false?

I ran the program straight from the mp4. Would it actually run faster if I extracted the audio file and used that instead?

I noticed a few instances where the same line was repeated over and over. What settings would stop that?
 
Here is my perosnal experience: to create decent subs, there are at least three necessary steps:

(a)- preprocess the audio: at the minimum preprocess using a VAD (voice activity detection) or equivalent;
(b)- the right settings for Whisper
(c)- post-process the results for removal of rubbish.

Among above three steps, the first one is the MOST impactful one. I don't hink one can get a decent sub without that step, full stop. There are packages that have combined VAD with Whisper: e.g faster-whisper. There are also implementations that have combined all three steps together: e.g "Whisper with VAD", which is the go-to package in this forum.

That is my experience. Would love to hear from others of course.

PS. what is your GPU?
 
Last edited:
  • Like
Reactions: r00g