OpenAI made massive waves in 2022 when they released Dall-E 2 and ChatGPT, but that year they also released Whisper for Automatic Speech Recognition, meaning transcripts. Not just transcripts, but direct-to-English transcripts from any of 97 different languages, including Japanese. In practice, these are a high watermark for auto transcriptions, and will only get better with time like most of these AI techniques.
What this means is that you (yes, YOU) can, without collaborators, generate and edit English subtitles for your favorite JAV.
Some kind souls have put up free portals either on Google Colab or to use up their own GPU cycles that they aren't using. Otherwise, you will need to use Python or the command line.
I'll update this post and the FAQ based on feedback and discussion. Better to have an early version of this out than to sit on it until I never perfect it.
How it Works
Like the other recent AI models, this is a giant Neural Network (NN) trained on a huge amount of diverse data. There's a bunch of moving parts to this, but what sets it apart from other transcriptions.
- The model is trained on 680,000 hours of video across 97 different languages, split up into 30 second chunks.
- Translates directly from the source language to English subtitles, leading to some better interpretations.
- Easy to set up in Python for both CPU or GPU. Get Anaconda, and the command line from inside the Anaconda Launcher is going to be sufficient for people with even limited programming experience.
- Various online portals of people loaning out their spare GPU time so you can do Medium or Large models.
- Small produces reasonable results off a CPU, and the Medium model runs off of semi-recent GPUs. The Large model is too large to run for even avid gamers, and there is no way to cheat the hard VRAM requirements.
- Built-in Voice Detection systems and timing. The Google Colab that people use implements a more consistent Voice Detection system, which likely also leads to better subtitle timings.
- The basic flow of Whisper is that it attempts to identify single speaking segments of a single speaker, and attempts to determine the most likely transcription of that dialogue chunk. It is giving probabilities to a whole bunch of different possibilities, and then choosing the one chunk that seems to be the most likely. Among those probabilities is a prediction for "this segment isn't actually speech", and many of the parameters you choose for the model are choosing what it does with this probabilities.
- While you can technically throw a full film as the target file to transcribe/translate, it is better to make a separate file that is just the 100MB-200MB audio file. Use MKVToolNix or FFMPEG.
- Can be run via Python script, notebook, or command line. I prefer command line. Notebooks will have problems with freeing up the VRAM.
- Japanese is one of the better languages for this by standard transcription and translation metrics, but dirtiness and pronouns will be a frequent problem to work around.
Strengths of Whisper
⦁ Makes it feasible for one person to generate subtitle files for what used to be a multi-person job. One person, with Whisper, can produce timed subtitle files with in English from any language.
⦁ Does a remarkably good job at discerning Japanese names and plot details. For something like Attackers movies, this is one of the things that people are most interested in, and this ends up being comparable to or better than most of the subs you'll find online.
⦁ The subs can be a lot more complete. Most JAV subs just don't bother translating a lot of the lesser text, but if you're calibrating the detecton thresholds, you will have far better coverage of speech than you're used to seeing.
⦁ Easily outclasses the flood of English subtitles that are based on Chinese hardcoded subtitles that have been flooding the web for the last two years.
Weaknesses of Whisper
⦁ While there are undoubtedly going to be many sites in the future that use Whisper transcriptions without editing, a single editing pass is going to greatly improve the quality and readability. It is easy to catch badly interpreted lines when doing a test run.
⦁ Whisper is bad at "talking dirty". There are likely ways to fix this, but it isn't hard to see why the training data sets might veer away from filthy language.
⦁ For Japanese-to-English transcriptions, the models that can run on a CPU really don't cut it. You need a semi-recent GPU to be able to run the Medium model, and the Large model that produces by far the best results is just totally out of the price range for nearly all casual users. As such, most of my advice in this thread is going to be for maximizing the quality of the Medium models.
⦁ Japanese is a genuinely difficult language to get pronouns right. This is something that can mostly be corrected during the editing watch.
⦁ Some of the tweaking parameters can be difficult to intuit what is a good value, and there may be substantially different parameters between what is good for a new movie and and what is good for something from a decade ago.
Parameter Tuning
Whisper has a bunch of parameters and the jargon descriptions do not make much sense. Whatever their original intent is, there is likely different set parameters
Most of these parameters have to deal with adjusting how it interprets the different probabilities associated to transcriptions or the absence of speech altogether.
Task: ='translate' or 'transcribe'. Transcribe will output the source's the language (so Japanese-to-Japanese transcriptions), while Translate means the output will be the English translation.
no_speech_threshold: How certain it needs to be for
logprob_threshold: Some log-scale metric for detection when there is no speech. The default is -1.0, but it goes up on a log scale so 0, +0.5, +1.5, +10 are all valid values? I have not gotten a good intuition for this.
compression_ratio_threshold: 2.4 default. Some measure of how much the transcription needs to be distinct and not just the same line over and over in a way that would compress too well. I think? No idea how to intuit this.
temperature: A measure of how much randomness goes into the transription and translaton process. This seems unintuitive at first, but doing a bunch of initial conditiosn and seeing what comes out, and comparing all of the probabilities and selection the best, produces better outcomes.
best_of: if Temperature is not 0, tells it how many times to attempt to translate/transcribe the segment.
beam_size: Default of 5, only used when Temperature is 0. Some measure of how broad of a search to do for the best solution, with larger values needing more VRAM?
condition_on_previous_text: Defaults to True, only other option is False at the the time of this writing. This setting encourages the model to learn a specific topic, a specific style, or discern more consistent translations of proper nouns. I strongly recommend you use False when translating JAV. Since JAV can have many portions that are difficult to detect and transcribe and translate (not enough JAV in their training data), having this set to True leads to some areas where the same translation is used line after line, leading to such a strong bias towards some line that the translation may never recover. If you're being greedy, it may be worthwhile to translate a film twice, once with True and once with False, and then manually picking the best translations for each line. There are strengths to both.
initial_prompt: You can enter a sentence or series of words as a string to try to bias the translation in some way. It is not clear whether this is supposed to applay to only the first segment or the entire transcript, but in future versions it will probably be a bit more clear or adaptable.
suppress_tokens: There are some words or phrases or special characters that are ignored. There may be some way to either undo that or introduce new suppressed words, but I do not know how to use this correctly.
language: 'ja' for Japaese, not 'jp'. Whisper automatically detects the language, or is supposed to but, just be up front when using it on JAV.
Useful references:
Discussion on choosing parameters:
In Deepgram's latest blog, we will explore some of the options in OpenAI Whisper’s inference and see how they impact results. Read more here!...
blog.deepgram.com
The original paper:
Setup Guide from user SamKook: This is from September so maybe the process has improved.
Anaconda - A Standard Python Install for Windows or Other OS:
Take your first steps using Anaconda Distribution, working with conda, and writing your first Python program.
anaconda.cloud
Installing all of the command line and Python stuff - IMO do this in the Anaconda Terminal that you can open from inside the Anaconda Launcher.
OpenAI's Whisper model can perform Speech Recognition on a wide selection of languages. We'll learn how to run Whisper before checking out a performance analysis in this simple guide.
www.assemblyai.com
FAQ (will expand)
Q: I have a problem with getting this or Python set up on my computer.
A: I can't help you, sorry. None of the steps are hard but there can be compatability issues that require either really knowing what you are doing or just restarting the entire installation process. Just be sure to make sure you are using Cuda if you want to leverage your GPU.
Q: Can I post some example Subtitles here for help?
A: Yes, I think that is a good use of this thread. Showing off the process and what gets better results is useful.