Thanks for putting in the work with this. It's been really easy to use and fast to generate for multiple files. Obviously it's not perfect for everything, but it's been pretty good for the interview portion of "amateur" woman off the street vids.
I am having the same issue, did you ever find a way to resolve the CUDA runtime error?guys im having this problem , anyone knows the reason?
RuntimeError: CUDA failed with error CUDA driver version is insufficient for CUDA runtime version
and also im having another error while trying the connect gpu , is that related to membership ? I thought its free
Unable to connect to GPU backend
You cannot connect to the GPU at this time due to usage limits in Colab. Get information
For more GPU access, you can purchase a Colab processing unit with Pay As You Go.
I can't say it's fast, but you'd be surprise how the small model performs. I only have a 6GB NVIDIA card so I can't do much more with the regular whisper anyways. The batchis pretty straightforward and runs in the background so it doesn't bother me.Arguments = $"-m whisper \"{inputFilePath}\" --model small --language Japanese --fp16 False --condition_on_previous_text False --temperature 0.2 --beam_size 10 --best_of 10 --patience 2 --output_dir \"{outputDir}\" --output_format srt"
class Subtitle
{
public int Index { get; set; }
public TimeSpan StartTime { get; set; }
public TimeSpan EndTime { get; set; }
public string Text { get; set; }
public override string ToString()
{
return $"{Index}\n{StartTime:hh\\:mm\\:ss\\,fff} --> {EndTime:hh\\:mm\\:ss\\,fff}\n{Text}\n";
}
}
string TrimRepeatedPhrases(string input)
{
// Remove repeated sequences of phrases (3 or more repetitions)
input = Regex.Replace(input, @"(\b\w+\b(?:\s+\w+){0,3})\s+(\1\s*){2,}", "$1", RegexOptions.IgnoreCase);
// Remove repeated hyphenated syllables or words (e.g., "Ba-ba-ba")
input = Regex.Replace(input, @"(\b\w+-)(\1){2,}", "$1", RegexOptions.IgnoreCase);
// Remove repeated sequences of words (3 or more repetitions)
input = Regex.Replace(input, @"\b(\w+)(?:\s+\1\b){2,}", "$1", RegexOptions.IgnoreCase);
// Remove sequences of the same word or stutter patterns (like "ah, ah, ah...")
input = Regex.Replace(input, @"(\b\w+\b)(?:,\s*\1){2,}", "$1");
// Remove excessive sequences of repeated characters (e.g., "aaaaaaaaaaaaa")
input = Regex.Replace(input, @"(\w)\1{4,}", "$1");
// Normalize spaces and trim
input = Regex.Replace(input, @"\s+", " ").Trim();
return input.Trim();
}
List<Subtitle> FilterRepeatedSubtitles(List<Subtitle> subtitles)
{
List<Subtitle> filteredSubs = new List<Subtitle>();
for (int i = 0; i < subtitles.Count; i++)
{
if (i > 0 && subtitles.Text.Trim() == subtitles[i - 1].Text.Trim())
{
continue;
}
filteredSubs.Add(subtitles);
}
return filteredSubs;
}
void RenumberSubtitles(List<Subtitle> subtitles)
{
for (int i = 0; i < subtitles.Count; i++)
{
subtitles.Index = i + 1; // Ensure sequential numbering starting from 1
}
}
I don't have much time to test this more for now...
In any case, the end result is 10 times better than almost any subtitle I've downloaded so far. It's in sync, and nothing's missing. Even if the translation is not always perfect, it's good enough for the most part.
I think [ i ] was interpreted as italics hereMy grain of salt for programmers out there.
I cloned the regular Whisper and created a python 3.9 environement with all the dependencies not to mess with my global python.
I scrape movies into folders as part of my download routine.
I made a batch in LINQPad that go over folders that don't contain subs.
From each folder, I extract the audio from the movie with ffmpeg.NET, and lauch a new Python process with:
I can't say it's fast, but you'd be surprise how the small model performs. I only have a 6GB NVIDIA card so I can't do much more with the regular whisper anyways. The batchis pretty straightforward and runs in the background so it doesn't bother me.
I found translating the files myself gave a better end result, but a little more post processing.
For instance, Google doc translate will translate numbers from 21 to 25 into words, messing the sub indexes.
the post processing also includes a bunch of regular expressions to trim a few stuff like, clean each sub entry and repetitions.
I can't say it's perfect, but with the wisper params up there, the output is already not bad at all; the post-process doesn't have much to do.
Of course, there's a little manual work involved due to the fact I don't use direct translate. I tried to use a headless browser in a nodeJS module to upload the files to Google Doc translate. I couldn't make the translation happen succesfully and ended up with the original file being downloaded. I don't have much time to test this more for now...
In any case, the end result is 10 times better than almost any subtitle I've downloaded so far. It's in sync, and nothing's missing. Even if the translation is not always perfect, it's good enough for the most part.
This reminds me a bit of the prehistoric days of electronic cigarettes when we tried to make wicks out of porous stones or stainless steel sheets for artisanal mechanical devices that we imported at great expense.
I'm pretty sure things will evolve just as quickly here. It's already a miracle that we can even extract subtitles ourselves.
Hey thx.Great post. I specially like your approach to use Google Docs for translation rather than Google ranslate APIs. I'm guessing Docs will produce better translations --I'm guessing it uses context.
Just to share with you some of my learnings --because I too started from low vram (4gb 1050ti). I started from basic, then I noticed that I was happier with the results from the medium model. Then I started having apetite for the quantized large model , and that's where I am now.
One appraoch (I haven't adopted it yet) is to use the basic model to identify the speech segments for accurate timestamps, then to use the large model to transcribe those segments. The newest whisper version has an option --clip_timestamps that can be used to pass the exact speech segments to it to transcribe.
For getting more accuarte timestamps, the option --word_timestamps is quite helpful.
For filtering out rubbish resuls, have you considered to add a filter-list or dictionary. Feel free to use the filter list in my repo: github dot com /meizhong986/WhisperJAV/tree/main/notebook. The list is asssembled almost entirely from this community. I have tried to maintain it.
Have people found other Checkpoints/Models to subtitle Jav, other than the regular OpenAI Large models? There's going to be better out there if the entire model space is used for JP->EN instead of using the space for every language. I don't think any HuggingFace models are openly "good for JAV" for obvious reasons.
I think the value in that model is that it can translate English to Japanese instead of Japanese to English. The English subtitles are worse than Large-v3, and have bad formatting issues.You probably want to check this one out: Kotoba JP_EN Bilingual .
I havne't had time to play with it yet. Let me know what you find out.
Hi, I'm new to this, and yesterday I tried making subs in Whisper v07b using audio converted from VLC.
I encountered a problem, the ZIP file is empty every time I run. There also no srt file to be found in my Drive folder.
Every step is run successfully, and it seems Whisper know what file is loaded into the drive, but after every run, I get empty ZIP file and no SRT to be found.
Does anyone know what might be the problem?
You're referring to WhisperJAV, right? Some of the usual suspects are:
-- check that you're using the latest fixes. 07b had a minor fix a couple of days ago.
-- double check that you're not getting: your session has crashed pop up message from colab.
-- check the audio files extensions. Accepted extensions are: ".mp3", ".wav", ".aac", ".m4a", ".ogg", ".flac".
-- check that the audio files are uploaded correctly (say listen to them or check their sizes).
-- check that your folder name is exactly WhisperJAV (case sensitive).
If none of the above, send me a screenshot of the last screen output. I'll check it.
What's your plan on updating the existing collab especially the WhisperwithVADpro. I've been using it since its inception and it works wonder.You're referring to WhisperJAV, right? Some of the usual suspects are:
-- check that you're using the latest fixes. 07b had a minor fix a couple of days ago.
-- double check that you're not getting: your session has crashed pop up message from colab.
-- check the audio files extensions. Accepted extensions are: ".mp3", ".wav", ".aac", ".m4a", ".ogg", ".flac".
-- check that the audio files are uploaded correctly (say listen to them or check their sizes).
-- check that your folder name is exactly WhisperJAV (case sensitive).
If none of the above, send me a screenshot of the last screen output. I'll check it.
What's your plan on updating the existing collab especially the WhisperwithVADpro. I've been using it since its inception and it works wonder.
Can you share the google colab link of the WhisperwithVADpro, I'm assuming it's a better version of this?
What's your plan on updating the existing collab especially the WhisperwithVADpro. I've been using it since its inception and it works wonder.
Google Colab
colab.research.google.com
This is the link for the WhisperwithVADpro. There's a bunch of new settings that can be tweaked to get better results. For default settings you may refer or used the notes at the bottom of it. Keep it mind that it may takes longer to transcribe the audio.