Whisper (OpenAI) - Automatic English Subtitles for Any Film in Any Language - An Intro & Guide to Subtitling JAV

Thanks for putting in the work with this. It's been really easy to use and fast to generate for multiple files. Obviously it's not perfect for everything, but it's been pretty good for the interview portion of "amateur" woman off the street vids.
 
guys im having this problem , anyone knows the reason?


RuntimeError: CUDA failed with error CUDA driver version is insufficient for CUDA runtime version

and also im having another error while trying the connect gpu , is that related to membership ? I thought its free
Unable to connect to GPU backend
You cannot connect to the GPU at this time due to usage limits in Colab. Get information
For more GPU access, you can purchase a Colab processing unit with Pay As You Go.
I am having the same issue, did you ever find a way to resolve the CUDA runtime error?
 
My grain of salt for programmers out there.
I cloned the regular Whisper and created a python 3.9 environement with all the dependencies not to mess with my global python.
I scrape movies into folders as part of my download routine.
I made a batch in LINQPad that go over folders that don't contain subs.
From each folder, I extract the audio from the movie with ffmpeg.NET, and lauch a new Python process with:
Arguments = $"-m whisper \"{inputFilePath}\" --model small --language Japanese --fp16 False --condition_on_previous_text False --temperature 0.2 --beam_size 10 --best_of 10 --patience 2 --output_dir \"{outputDir}\" --output_format srt"
I can't say it's fast, but you'd be surprise how the small model performs. I only have a 6GB NVIDIA card so I can't do much more with the regular whisper anyways. The batchis pretty straightforward and runs in the background so it doesn't bother me.

I found translating the files myself gave a better end result, but a little more post processing.
For instance, Google doc translate will translate numbers from 21 to 25 into words, messing the sub indexes.
the post processing also includes a bunch of regular expressions to trim a few stuff like, clean each sub entry and repetitions.

class Subtitle
{
public int Index { get; set; }
public TimeSpan StartTime { get; set; }
public TimeSpan EndTime { get; set; }
public string Text { get; set; }

public override string ToString()
{
return $"{Index}\n{StartTime:hh\\:mm\\:ss\\,fff} --> {EndTime:hh\\:mm\\:ss\\,fff}\n{Text}\n";
}
}


string TrimRepeatedPhrases(string input)
{
// Remove repeated sequences of phrases (3 or more repetitions)
input = Regex.Replace(input, @"(\b\w+\b(?:\s+\w+){0,3})\s+(\1\s*){2,}", "$1", RegexOptions.IgnoreCase);

// Remove repeated hyphenated syllables or words (e.g., "Ba-ba-ba")
input = Regex.Replace(input, @"(\b\w+-)(\1){2,}", "$1", RegexOptions.IgnoreCase);

// Remove repeated sequences of words (3 or more repetitions)
input = Regex.Replace(input, @"\b(\w+)(?:\s+\1\b){2,}", "$1", RegexOptions.IgnoreCase);

// Remove sequences of the same word or stutter patterns (like "ah, ah, ah...")
input = Regex.Replace(input, @"(\b\w+\b)(?:,\s*\1){2,}", "$1");

// Remove excessive sequences of repeated characters (e.g., "aaaaaaaaaaaaa")
input = Regex.Replace(input, @"(\w)\1{4,}", "$1");

// Normalize spaces and trim
input = Regex.Replace(input, @"\s+", " ").Trim();

return input.Trim();
}
List<Subtitle> FilterRepeatedSubtitles(List<Subtitle> subtitles)
{
List<Subtitle> filteredSubs = new List<Subtitle>();

for (int i = 0; i < subtitles.Count; i++)
{
if (i > 0 && subtitles.Text.Trim() == subtitles[i - 1].Text.Trim())
{
continue;
}

filteredSubs.Add(subtitles);
}

return filteredSubs;
}

void RenumberSubtitles(List<Subtitle> subtitles)
{
for (int i = 0; i < subtitles.Count; i++)
{
subtitles.Index = i + 1; // Ensure sequential numbering starting from 1
}
}


I can't say it's perfect, but with the wisper params up there, the output is already not bad at all; the post-process doesn't have much to do.
Of course, there's a little manual work involved due to the fact I don't use direct translate. I tried to use a headless browser in a nodeJS module to upload the files to Google Doc translate. I couldn't make the translation happen succesfully and ended up with the original file being downloaded. I don't have much time to test this more for now...
In any case, the end result is 10 times better than almost any subtitle I've downloaded so far. It's in sync, and nothing's missing. Even if the translation is not always perfect, it's good enough for the most part.

This reminds me a bit of the prehistoric days of electronic cigarettes when we tried to make wicks out of porous stones or stainless steel sheets for artisanal mechanical devices that we imported at great expense.
I'm pretty sure things will evolve just as quickly here. It's already a miracle that we can even extract subtitles ourselves.
 
Last edited:
I don't have much time to test this more for now...
In any case, the end result is 10 times better than almost any subtitle I've downloaded so far. It's in sync, and nothing's missing. Even if the translation is not always perfect, it's good enough for the most part.

Great post. I specially like your approach to use Google Docs for translation rather than Google ranslate APIs. I'm guessing Docs will produce better translations --I'm guessing it uses context.

Just to share with you some of my learnings --because I too started from low vram (4gb 1050ti). I started from basic, then I noticed that I was happier with the results from the medium model. Then I started having apetite for the quantized large model :) , and that's where I am now.

One appraoch (I haven't adopted it yet) is to use the basic model to identify the speech segments for accurate timestamps, then to use the large model to transcribe those segments. The newest whisper version has an option --clip_timestamps that can be used to pass the exact speech segments to it to transcribe.

For getting more accuarte timestamps, the option --word_timestamps is quite helpful.

For filtering out rubbish resuls, have you considered to add a filter-list or dictionary. Feel free to use the filter list in my repo: github dot com /meizhong986/WhisperJAV/tree/main/notebook. The list is asssembled almost entirely from this community. I have tried to maintain it.
 
My grain of salt for programmers out there.
I cloned the regular Whisper and created a python 3.9 environement with all the dependencies not to mess with my global python.
I scrape movies into folders as part of my download routine.
I made a batch in LINQPad that go over folders that don't contain subs.
From each folder, I extract the audio from the movie with ffmpeg.NET, and lauch a new Python process with:

I can't say it's fast, but you'd be surprise how the small model performs. I only have a 6GB NVIDIA card so I can't do much more with the regular whisper anyways. The batchis pretty straightforward and runs in the background so it doesn't bother me.

I found translating the files myself gave a better end result, but a little more post processing.
For instance, Google doc translate will translate numbers from 21 to 25 into words, messing the sub indexes.
the post processing also includes a bunch of regular expressions to trim a few stuff like, clean each sub entry and repetitions.










I can't say it's perfect, but with the wisper params up there, the output is already not bad at all; the post-process doesn't have much to do.
Of course, there's a little manual work involved due to the fact I don't use direct translate. I tried to use a headless browser in a nodeJS module to upload the files to Google Doc translate. I couldn't make the translation happen succesfully and ended up with the original file being downloaded. I don't have much time to test this more for now...
In any case, the end result is 10 times better than almost any subtitle I've downloaded so far. It's in sync, and nothing's missing. Even if the translation is not always perfect, it's good enough for the most part.

This reminds me a bit of the prehistoric days of electronic cigarettes when we tried to make wicks out of porous stones or stainless steel sheets for artisanal mechanical devices that we imported at great expense.
I'm pretty sure things will evolve just as quickly here. It's already a miracle that we can even extract subtitles ourselves.
I think [ i ] was interpreted as italics here
 
Great post. I specially like your approach to use Google Docs for translation rather than Google ranslate APIs. I'm guessing Docs will produce better translations --I'm guessing it uses context.

Just to share with you some of my learnings --because I too started from low vram (4gb 1050ti). I started from basic, then I noticed that I was happier with the results from the medium model. Then I started having apetite for the quantized large model :) , and that's where I am now.

One appraoch (I haven't adopted it yet) is to use the basic model to identify the speech segments for accurate timestamps, then to use the large model to transcribe those segments. The newest whisper version has an option --clip_timestamps that can be used to pass the exact speech segments to it to transcribe.

For getting more accuarte timestamps, the option --word_timestamps is quite helpful.

For filtering out rubbish resuls, have you considered to add a filter-list or dictionary. Feel free to use the filter list in my repo: github dot com /meizhong986/WhisperJAV/tree/main/notebook. The list is asssembled almost entirely from this community. I have tried to maintain it.
Hey thx.
I'm not using the API because I don't want to be dependent on some key and all that goes with it. I'd also probably max out the API free use very quickly.
I did manage to automate tranlation from the online regular interface with playwright. That means I could eventually split each sub somewhere before I hit the 5000 characters and rebuild it with the translated segments. There are challenges in minimizing the number of calls and characters you send for translations... I tried to strip the timestamps and indexes and just send paragraphs, each line being a subtitle, but the google's output didn't respect the line breaks. Sending whole chunks under 5000 characters worked well.

I'd rather use Google Doc though. Since it only uses documents, as part of my batch, I convert the srt in docx. For now, I drag them manually into Google Doc tanslate; it takes two minutes to process 10 files, so not the end of the word. If I can't manage to automate that, I'll fall back on the first option previously mentioned.
Does Google doc use whole document scope for it's transaltion context? I'm not sure, I think the only context it uses is within a sentence scope, but I may be wrong.

So you're actually running 2 passes on the audio! That must take some time. It's true that with the small model timestamps seem mostly accurate. I have nothing to compare it with. But you have the 30 seconds "ahh" and "ugh" or "u", and some other unsually large time spans as well even after you've cleaned allucinations.
I haven't delved into it yet but have you thought about just adding an additional filter in your ffmpeg command to reduce ambient noise? Maybe that would do the trick and it would spare you a pass...

My next step will be to try other versions of whisper when I have the time. Any suggestion relevant to 2024 is welcome.

I read someone mentioned that wav format are much faster to process by whisper. I'm not sure if it was in the context of another mod of whisper or any version.
Can anyone confirm this please?
On my end I'm keeping the exact same codec to extract audio, not to waste time and resources on conversion (usually aac). But if I gain 10mn on the conversion and lose 30mn in the whisper process, I may reconsider!
 
Last edited:
  • Like
Reactions: mei2
Whisper only works with wav files internally so it always converts the audio from whatever you feed to it. Audio conversion is quick and doing it before or during is going to take the same time, unless you want to save time specifically during whisper execution.
 
  • Like
Reactions: Elmerbs and mei2
Have people found other Checkpoints/Models to subtitle Jav, other than the regular OpenAI Large models? There's going to be better out there if the entire model space is used for JP->EN instead of using the space for every language. I don't think any HuggingFace models are openly "good for JAV" for obvious reasons.
 
Have people found other Checkpoints/Models to subtitle Jav, other than the regular OpenAI Large models? There's going to be better out there if the entire model space is used for JP->EN instead of using the space for every language. I don't think any HuggingFace models are openly "good for JAV" for obvious reasons.

You probably want to check this one out: Kotoba JP_EN Bilingual .
I havne't had time to play with it yet. Let me know what you find out.
 
You probably want to check this one out: Kotoba JP_EN Bilingual .
I havne't had time to play with it yet. Let me know what you find out.
I think the value in that model is that it can translate English to Japanese instead of Japanese to English. The English subtitles are worse than Large-v3, and have bad formatting issues.
 
Hi, I'm new to this, and yesterday I tried making subs in Whisper v07b using audio converted from VLC.

I encountered a problem, the ZIP file is empty every time I run. There also no srt file to be found in my Drive folder.

Every step is run successfully, and it seems Whisper know what file is loaded into the drive, but after every run, I get empty ZIP file and no SRT to be found.

Does anyone know what might be the problem?
 
Hi, I'm new to this, and yesterday I tried making subs in Whisper v07b using audio converted from VLC.

I encountered a problem, the ZIP file is empty every time I run. There also no srt file to be found in my Drive folder.

Every step is run successfully, and it seems Whisper know what file is loaded into the drive, but after every run, I get empty ZIP file and no SRT to be found.

Does anyone know what might be the problem?

You're referring to WhisperJAV, right? Some of the usual suspects are:
-- check that you're using the latest fixes. 07b had a minor fix a couple of days ago.
-- double check that you're not getting: your session has crashed pop up message from colab.
-- check the audio files extensions. Accepted extensions are: ".mp3", ".wav", ".aac", ".m4a", ".ogg", ".flac".
-- check that the audio files are uploaded correctly (say listen to them or check their sizes).
-- check that your folder name is exactly WhisperJAV (case sensitive).

If none of the above, send me a screenshot of the last screen output. I'll check it.
 
  • Like
Reactions: budiblek
You're referring to WhisperJAV, right? Some of the usual suspects are:
-- check that you're using the latest fixes. 07b had a minor fix a couple of days ago.
-- double check that you're not getting: your session has crashed pop up message from colab.
-- check the audio files extensions. Accepted extensions are: ".mp3", ".wav", ".aac", ".m4a", ".ogg", ".flac".
-- check that the audio files are uploaded correctly (say listen to them or check their sizes).
-- check that your folder name is exactly WhisperJAV (case sensitive).

If none of the above, send me a screenshot of the last screen output. I'll check it.

Thank you so much for the input.

I tried running WhisperJAV just now and I got the result already. My only mistake was not renaming my file after converting the video into audio ones. VLC did not change the extension of the audio file, so that Google Drive uploaded the file into its original video format (.mp4).

By any chance, does audio format matters? I tried using mp3, and it only generates like 400 lines. There are some lines that wasn't captured.
Should I try using better audio format? (like.flac) Or can I tinker with some numbers in the setting in the Google collab page?
 
  • Like
Reactions: mei2
You're referring to WhisperJAV, right? Some of the usual suspects are:
-- check that you're using the latest fixes. 07b had a minor fix a couple of days ago.
-- double check that you're not getting: your session has crashed pop up message from colab.
-- check the audio files extensions. Accepted extensions are: ".mp3", ".wav", ".aac", ".m4a", ".ogg", ".flac".
-- check that the audio files are uploaded correctly (say listen to them or check their sizes).
-- check that your folder name is exactly WhisperJAV (case sensitive).

If none of the above, send me a screenshot of the last screen output. I'll check it.
What's your plan on updating the existing collab especially the WhisperwithVADpro. I've been using it since its inception and it works wonder.
 
Can you share the google colab link of the WhisperwithVADpro, I'm assuming it's a better version of this?


This is the link for the WhisperwithVADpro. There's a bunch of new settings that can be tweaked to get better results. For default settings you may refer or used the notes at the bottom of it. Keep it mind that it may takes longer to transcribe the audio.
 
  • Like
Reactions: noirzmonster
What's your plan on updating the existing collab especially the WhisperwithVADpro. I've been using it since its inception and it works wonder.

Credit goes to @anon_entity, the original developer of WhisperVAD. He had put together a robust algorithm.
I was not planning any update to that colab. But if there is any ideas for improvments or new features, I'm happy to look into it.
 


This is the link for the WhisperwithVADpro. There's a bunch of new settings that can be tweaked to get better results. For default settings you may refer or used the notes at the bottom of it. Keep it mind that it may takes longer to transcribe the audio.

Thanks! For some reason I couldn't find this on google