Whisper (OpenAI) - Automatic English Subtitles for Any Film in Any Language - An Intro & Guide to Subtitling JAV

montblanc5

New Member
Aug 25, 2024
7
5
Thanks for putting in the work with this. It's been really easy to use and fast to generate for multiple files. Obviously it's not perfect for everything, but it's been pretty good for the interview portion of "amateur" woman off the street vids.
 

Goldenpope

New Member
Jun 26, 2018
5
0
guys im having this problem , anyone knows the reason?


RuntimeError: CUDA failed with error CUDA driver version is insufficient for CUDA runtime version

and also im having another error while trying the connect gpu , is that related to membership ? I thought its free
Unable to connect to GPU backend
You cannot connect to the GPU at this time due to usage limits in Colab. Get information
For more GPU access, you can purchase a Colab processing unit with Pay As You Go.
I am having the same issue, did you ever find a way to resolve the CUDA runtime error?
 

Elmerbs

New Member
Feb 20, 2024
2
4
My grain of salt for programmers out there.
I cloned the regular Whisper and created a python 3.9 environement with all the dependencies not to mess with my global python.
I scrape movies into folders as part of my download routine.
I made a batch in LINQPad that go over folders that don't contain subs.
From each folder, I extract the audio from the movie with ffmpeg.NET, and lauch a new Python process with:
Arguments = $"-m whisper \"{inputFilePath}\" --model small --language Japanese --fp16 False --condition_on_previous_text False --temperature 0.2 --beam_size 10 --best_of 10 --patience 2 --output_dir \"{outputDir}\" --output_format srt"
I can't say it's fast, but you'd be surprise how the small model performs. I only have a 6GB NVIDIA card so I can't do much more with the regular whisper anyways. The batchis pretty straightforward and runs in the background so it doesn't bother me.

I found translating the files myself gave a better end result, but a little more post processing.
For instance, Google doc translate will translate numbers from 21 to 25 into words, messing the sub indexes.
the post processing also includes a bunch of regular expressions to trim a few stuff like, clean each sub entry and repetitions.

class Subtitle
{
public int Index { get; set; }
public TimeSpan StartTime { get; set; }
public TimeSpan EndTime { get; set; }
public string Text { get; set; }

public override string ToString()
{
return $"{Index}\n{StartTime:hh\\:mm\\:ss\\,fff} --> {EndTime:hh\\:mm\\:ss\\,fff}\n{Text}\n";
}
}


string TrimRepeatedPhrases(string input)
{
// Remove repeated sequences of phrases (3 or more repetitions)
input = Regex.Replace(input, @"(\b\w+\b(?:\s+\w+){0,3})\s+(\1\s*){2,}", "$1", RegexOptions.IgnoreCase);

// Remove repeated hyphenated syllables or words (e.g., "Ba-ba-ba")
input = Regex.Replace(input, @"(\b\w+-)(\1){2,}", "$1", RegexOptions.IgnoreCase);

// Remove repeated sequences of words (3 or more repetitions)
input = Regex.Replace(input, @"\b(\w+)(?:\s+\1\b){2,}", "$1", RegexOptions.IgnoreCase);

// Remove sequences of the same word or stutter patterns (like "ah, ah, ah...")
input = Regex.Replace(input, @"(\b\w+\b)(?:,\s*\1){2,}", "$1");

// Remove excessive sequences of repeated characters (e.g., "aaaaaaaaaaaaa")
input = Regex.Replace(input, @"(\w)\1{4,}", "$1");

// Normalize spaces and trim
input = Regex.Replace(input, @"\s+", " ").Trim();

return input.Trim();
}
List<Subtitle> FilterRepeatedSubtitles(List<Subtitle> subtitles)
{
List<Subtitle> filteredSubs = new List<Subtitle>();

for (int i = 0; i < subtitles.Count; i++)
{
if (i > 0 && subtitles.Text.Trim() == subtitles[i - 1].Text.Trim())
{
continue;
}

filteredSubs.Add(subtitles);
}

return filteredSubs;
}

void RenumberSubtitles(List<Subtitle> subtitles)
{
for (int i = 0; i < subtitles.Count; i++)
{
subtitles.Index = i + 1; // Ensure sequential numbering starting from 1
}
}


I can't say it's perfect, but with the wisper params up there, the output is already not bad at all; the post-process doesn't have much to do.
Of course, there's a little manual work involved due to the fact I don't use direct translate. I tried to use a headless browser in a nodeJS module to upload the files to Google Doc translate. I couldn't make the translation happen succesfully and ended up with the original file being downloaded. I don't have much time to test this more for now...
In any case, the end result is 10 times better than almost any subtitle I've downloaded so far. It's in sync, and nothing's missing. Even if the translation is not always perfect, it's good enough for the most part.

This reminds me a bit of the prehistoric days of electronic cigarettes when we tried to make wicks out of porous stones or stainless steel sheets for artisanal mechanical devices that we imported at great expense.
I'm pretty sure things will evolve just as quickly here. It's already a miracle that we can even extract subtitles ourselves.
 
Last edited:

mei2

Well-Known Member
Dec 6, 2018
243
404
I don't have much time to test this more for now...
In any case, the end result is 10 times better than almost any subtitle I've downloaded so far. It's in sync, and nothing's missing. Even if the translation is not always perfect, it's good enough for the most part.

Great post. I specially like your approach to use Google Docs for translation rather than Google ranslate APIs. I'm guessing Docs will produce better translations --I'm guessing it uses context.

Just to share with you some of my learnings --because I too started from low vram (4gb 1050ti). I started from basic, then I noticed that I was happier with the results from the medium model. Then I started having apetite for the quantized large model :) , and that's where I am now.

One appraoch (I haven't adopted it yet) is to use the basic model to identify the speech segments for accurate timestamps, then to use the large model to transcribe those segments. The newest whisper version has an option --clip_timestamps that can be used to pass the exact speech segments to it to transcribe.

For getting more accuarte timestamps, the option --word_timestamps is quite helpful.

For filtering out rubbish resuls, have you considered to add a filter-list or dictionary. Feel free to use the filter list in my repo: github dot com /meizhong986/WhisperJAV/tree/main/notebook. The list is asssembled almost entirely from this community. I have tried to maintain it.
 

Joker6969

Member
Sep 3, 2014
75
41
My grain of salt for programmers out there.
I cloned the regular Whisper and created a python 3.9 environement with all the dependencies not to mess with my global python.
I scrape movies into folders as part of my download routine.
I made a batch in LINQPad that go over folders that don't contain subs.
From each folder, I extract the audio from the movie with ffmpeg.NET, and lauch a new Python process with:

I can't say it's fast, but you'd be surprise how the small model performs. I only have a 6GB NVIDIA card so I can't do much more with the regular whisper anyways. The batchis pretty straightforward and runs in the background so it doesn't bother me.

I found translating the files myself gave a better end result, but a little more post processing.
For instance, Google doc translate will translate numbers from 21 to 25 into words, messing the sub indexes.
the post processing also includes a bunch of regular expressions to trim a few stuff like, clean each sub entry and repetitions.










I can't say it's perfect, but with the wisper params up there, the output is already not bad at all; the post-process doesn't have much to do.
Of course, there's a little manual work involved due to the fact I don't use direct translate. I tried to use a headless browser in a nodeJS module to upload the files to Google Doc translate. I couldn't make the translation happen succesfully and ended up with the original file being downloaded. I don't have much time to test this more for now...
In any case, the end result is 10 times better than almost any subtitle I've downloaded so far. It's in sync, and nothing's missing. Even if the translation is not always perfect, it's good enough for the most part.

This reminds me a bit of the prehistoric days of electronic cigarettes when we tried to make wicks out of porous stones or stainless steel sheets for artisanal mechanical devices that we imported at great expense.
I'm pretty sure things will evolve just as quickly here. It's already a miracle that we can even extract subtitles ourselves.
I think [ i ] was interpreted as italics here
 

Elmerbs

New Member
Feb 20, 2024
2
4
Great post. I specially like your approach to use Google Docs for translation rather than Google ranslate APIs. I'm guessing Docs will produce better translations --I'm guessing it uses context.

Just to share with you some of my learnings --because I too started from low vram (4gb 1050ti). I started from basic, then I noticed that I was happier with the results from the medium model. Then I started having apetite for the quantized large model :) , and that's where I am now.

One appraoch (I haven't adopted it yet) is to use the basic model to identify the speech segments for accurate timestamps, then to use the large model to transcribe those segments. The newest whisper version has an option --clip_timestamps that can be used to pass the exact speech segments to it to transcribe.

For getting more accuarte timestamps, the option --word_timestamps is quite helpful.

For filtering out rubbish resuls, have you considered to add a filter-list or dictionary. Feel free to use the filter list in my repo: github dot com /meizhong986/WhisperJAV/tree/main/notebook. The list is asssembled almost entirely from this community. I have tried to maintain it.
Hey thx.
I'm not using the API because I don't want to be dependent on some key and all that goes with it. I'd also probably max out the API free use very quickly.
I did manage to automate tranlation from the online regular interface with playwright. That means I could eventually split each sub somewhere before I hit the 5000 characters and rebuild it with the translated segments. There are challenges in minimizing the number of calls and characters you send for translations... I tried to strip the timestamps and indexes and just send paragraphs, each line being a subtitle, but the google's output didn't respect the line breaks. Sending whole chunks under 5000 characters worked well.

I'd rather use Google Doc though. Since it only uses documents, as part of my batch, I convert the srt in docx. For now, I drag them manually into Google Doc tanslate; it takes two minutes to process 10 files, so not the end of the word. If I can't manage to automate that, I'll fall back on the first option previously mentioned.
Does Google doc use whole document scope for it's transaltion context? I'm not sure, I think the only context it uses is within a sentence scope, but I may be wrong.

So you're actually running 2 passes on the audio! That must take some time. It's true that with the small model timestamps seem mostly accurate. I have nothing to compare it with. But you have the 30 seconds "ahh" and "ugh" or "u", and some other unsually large time spans as well even after you've cleaned allucinations.
I haven't delved into it yet but have you thought about just adding an additional filter in your ffmpeg command to reduce ambient noise? Maybe that would do the trick and it would spare you a pass...

My next step will be to try other versions of whisper when I have the time. Any suggestion relevant to 2024 is welcome.

I read someone mentioned that wav format are much faster to process by whisper. I'm not sure if it was in the context of another mod of whisper or any version.
Can anyone confirm this please?
On my end I'm keeping the exact same codec to extract audio, not to waste time and resources on conversion (usually aac). But if I gain 10mn on the conversion and lose 30mn in the whisper process, I may reconsider!
 
Last edited:
  • Like
Reactions: mei2

SamKook

Grand Wizard
Staff member
Super Moderator
Uploader
May 10, 2009
3,707
5,097
Whisper only works with wav files internally so it always converts the audio from whatever you feed to it. Audio conversion is quick and doing it before or during is going to take the same time, unless you want to save time specifically during whisper execution.
 
  • Like
Reactions: Elmerbs and mei2

panop857

Active Member
Sep 11, 2011
164
231
Have people found other Checkpoints/Models to subtitle Jav, other than the regular OpenAI Large models? There's going to be better out there if the entire model space is used for JP->EN instead of using the space for every language. I don't think any HuggingFace models are openly "good for JAV" for obvious reasons.
 

mei2

Well-Known Member
Dec 6, 2018
243
404
Have people found other Checkpoints/Models to subtitle Jav, other than the regular OpenAI Large models? There's going to be better out there if the entire model space is used for JP->EN instead of using the space for every language. I don't think any HuggingFace models are openly "good for JAV" for obvious reasons.

You probably want to check this one out: Kotoba JP_EN Bilingual .
I havne't had time to play with it yet. Let me know what you find out.