Whisper (OpenAI) - Automatic English Subtitles for Any Film in Any Language - An Intro & Guide to Subtitling JAV

montblanc5

New Member
Aug 25, 2024
7
5
Thanks for putting in the work with this. It's been really easy to use and fast to generate for multiple files. Obviously it's not perfect for everything, but it's been pretty good for the interview portion of "amateur" woman off the street vids.
 

Goldenpope

New Member
Jun 26, 2018
5
0
guys im having this problem , anyone knows the reason?


RuntimeError: CUDA failed with error CUDA driver version is insufficient for CUDA runtime version

and also im having another error while trying the connect gpu , is that related to membership ? I thought its free
Unable to connect to GPU backend
You cannot connect to the GPU at this time due to usage limits in Colab. Get information
For more GPU access, you can purchase a Colab processing unit with Pay As You Go.
I am having the same issue, did you ever find a way to resolve the CUDA runtime error?
 

Elmerbs

New Member
Feb 20, 2024
1
3
My grain of salt for programmers out there.
I cloned the regular Whisper and created a python 3.9 environement with all the dependencies not to mess with my global python.
I scrape movies into folders as part of my download routine.
I made a batch in LINQPad that go over folders that don't contain subs.
From each folder, I extract the audio from the movie with ffmpeg.NET, and lauch a new Python process with:
Arguments = $"-m whisper \"{inputFilePath}\" --model small --language Japanese --fp16 False --condition_on_previous_text False --temperature 0.2 --beam_size 10 --best_of 10 --patience 2 --output_dir \"{outputDir}\" --output_format srt"
I can't say it's fast, but you'd be surprise how the small model performs. I only have a 6GB NVIDIA card so I can't do much more with the regular whisper anyways. The batchis pretty straightforward and runs in the background so it doesn't bother me.

I found translating the files myself gave a better end result, but a little more post processing.
For instance, Google doc translate will translate numbers from 21 to 25 into words, messing the sub indexes.
the post processing also includes a bunch of regular expressions to trim a few stuff like, clean each sub entry and repetitions.

class Subtitle
{
public int Index { get; set; }
public TimeSpan StartTime { get; set; }
public TimeSpan EndTime { get; set; }
public string Text { get; set; }

public override string ToString()
{
return $"{Index}\n{StartTime:hh\\:mm\\:ss\\,fff} --> {EndTime:hh\\:mm\\:ss\\,fff}\n{Text}\n";
}
}


string TrimRepeatedPhrases(string input)
{
// Remove repeated sequences of phrases (3 or more repetitions)
input = Regex.Replace(input, @"(\b\w+\b(?:\s+\w+){0,3})\s+(\1\s*){2,}", "$1", RegexOptions.IgnoreCase);

// Remove repeated hyphenated syllables or words (e.g., "Ba-ba-ba")
input = Regex.Replace(input, @"(\b\w+-)(\1){2,}", "$1", RegexOptions.IgnoreCase);

// Remove repeated sequences of words (3 or more repetitions)
input = Regex.Replace(input, @"\b(\w+)(?:\s+\1\b){2,}", "$1", RegexOptions.IgnoreCase);

// Remove sequences of the same word or stutter patterns (like "ah, ah, ah...")
input = Regex.Replace(input, @"(\b\w+\b)(?:,\s*\1){2,}", "$1");

// Remove excessive sequences of repeated characters (e.g., "aaaaaaaaaaaaa")
input = Regex.Replace(input, @"(\w)\1{4,}", "$1");

// Normalize spaces and trim
input = Regex.Replace(input, @"\s+", " ").Trim();

return input.Trim();
}
List<Subtitle> FilterRepeatedSubtitles(List<Subtitle> subtitles)
{
List<Subtitle> filteredSubs = new List<Subtitle>();

for (int i = 0; i < subtitles.Count; i++)
{
if (i > 0 && subtitles.Text.Trim() == subtitles[i - 1].Text.Trim())
{
continue;
}

filteredSubs.Add(subtitles);
}

return filteredSubs;
}

void RenumberSubtitles(List<Subtitle> subtitles)
{
for (int i = 0; i < subtitles.Count; i++)
{
subtitles.Index = i + 1; // Ensure sequential numbering starting from 1
}
}


I can't say it's perfect, but with the wisper params up there, the output is already not bad at all; the post-process doesn't have much to do.
Of course, there's a little manual work involved due to the fact I don't use direct translate. I tried to use a headless browser in a nodeJS module to upload the files to Google Doc translate. I couldn't make the translation happen succesfully and ended up with the original file being downloaded. I don't have much time to test this more for now...
In any case, the end result is 10 times better than almost any subtitle I've downloaded so far. It's in sync, and nothing's missing. Even if the translation is not always perfect, it's good enough for the most part.

This reminds me a bit of the prehistoric days of electronic cigarettes when we tried to make wicks out of porous stones or stainless steel sheets for artisanal mechanical devices that we imported at great expense.
I'm pretty sure things will evolve just as quickly here. It's already a miracle that we can even extract subtitles ourselves.
 
Last edited:

mei2

Well-Known Member
Dec 6, 2018
242
403
I don't have much time to test this more for now...
In any case, the end result is 10 times better than almost any subtitle I've downloaded so far. It's in sync, and nothing's missing. Even if the translation is not always perfect, it's good enough for the most part.

Great post. I specially like your approach to use Google Docs for translation rather than Google ranslate APIs. I'm guessing Docs will produce better translations --I'm guessing it uses context.

Just to share with you some of my learnings --because I too started from low vram (4gb 1050ti). I started from basic, then I noticed that I was happier with the results from the medium model. Then I started having apetite for the quantized large model :) , and that's where I am now.

One appraoch (I haven't adopted it yet) is to use the basic model to identify the speech segments for accurate timestamps, then to use the large model to transcribe those segments. The newest whisper version has an option --clip_timestamps that can be used to pass the exact speech segments to it to transcribe.

For getting more accuarte timestamps, the option --word_timestamps is quite helpful.

For filtering out rubbish resuls, have you considered to add a filter-list or dictionary. Feel free to use the filter list in my repo: github dot com /meizhong986/WhisperJAV/tree/main/notebook. The list is asssembled almost entirely from this community. I have tried to maintain it.
 

Joker6969

Member
Sep 3, 2014
74
40
My grain of salt for programmers out there.
I cloned the regular Whisper and created a python 3.9 environement with all the dependencies not to mess with my global python.
I scrape movies into folders as part of my download routine.
I made a batch in LINQPad that go over folders that don't contain subs.
From each folder, I extract the audio from the movie with ffmpeg.NET, and lauch a new Python process with:

I can't say it's fast, but you'd be surprise how the small model performs. I only have a 6GB NVIDIA card so I can't do much more with the regular whisper anyways. The batchis pretty straightforward and runs in the background so it doesn't bother me.

I found translating the files myself gave a better end result, but a little more post processing.
For instance, Google doc translate will translate numbers from 21 to 25 into words, messing the sub indexes.
the post processing also includes a bunch of regular expressions to trim a few stuff like, clean each sub entry and repetitions.










I can't say it's perfect, but with the wisper params up there, the output is already not bad at all; the post-process doesn't have much to do.
Of course, there's a little manual work involved due to the fact I don't use direct translate. I tried to use a headless browser in a nodeJS module to upload the files to Google Doc translate. I couldn't make the translation happen succesfully and ended up with the original file being downloaded. I don't have much time to test this more for now...
In any case, the end result is 10 times better than almost any subtitle I've downloaded so far. It's in sync, and nothing's missing. Even if the translation is not always perfect, it's good enough for the most part.

This reminds me a bit of the prehistoric days of electronic cigarettes when we tried to make wicks out of porous stones or stainless steel sheets for artisanal mechanical devices that we imported at great expense.
I'm pretty sure things will evolve just as quickly here. It's already a miracle that we can even extract subtitles ourselves.
I think [ i ] was interpreted as italics here