Whisper and its many forms

Well I was hoping and led to believe elsewhere that on here, I'd be able to get accurate English subtitles to what I was looking for.
 
It's not impossible if someone who takes the time to translate themselves happens to work on the video you want and takes a few months to complete it but it's very unlikely. Someone isn't going to spend that kind of time on a video they don't want themselves.
 
  • Like
Reactions: Taako
SamKook,
I just want to thank you for putting up this tutorial. I am by no means technologically savvy, but was able to follow your instructions to produce reasonably decent translations.
I start with VLC Media Player to extract mp3 files and use "Whisper with Silero VAD --maintenance release" to process. I also disconnect the runtime after every job and then reconnect to begin the next job. It addresses the excess RAM and Disk usage and eliminated the job crashing early.
And yes, you need to edit the srt files to "clean-up" the obvious errors in translation. How much you do is entirely up to you. I use both Aegisub and Subtitle Edit. They both work well.
Again... Thanks for all your patience and efforts.
 
I'm not sure if this is allowed in this thread as it's not entirely Whisper related. Mods, please remove if it's inappropriate.

I've been using WhisperJAV and WhisperWithVAD_pro to generate subtitles that (no shade on mei2 or the Whisper model creators intended) have translations that range in quality from awful to acceptable.

I've just recently started to use WhisperWithVAD_pro to just transcribe the audio then using DeepSeek-V3 (via the API using Video Subtitle Master) to translate the transcription. I think the DeepSeek translations are substantially better quality than using end-to-end Whisper. The translations don't shy away from explicit language if you use an an appropriate prompt. Video Subtitle Master can use other, including local, LLMs and translation services but I haven't tried any others yet.

I've also tried translating a few non-porn subtitles from German and Chinese to English with good results. Chinese to English translation seems especially good which is not surprising with DeepSeek being a Chinese LLM.

The DeepSeek API is currently very cheap - costs me 1 to 1.5 cents per hour of video length (depending on the density of the speech of course). The API costs are going up in early Feb but even if was to be a ten-fold increase translation it will still be fairly inexpensive.

Anyone else tried this? If so, I'm interested to hear other people's results and tips.

I'm happy to provide more details of my workflow if that's useful to anyone.

(I have no links to Video Subtitle Master or DeepSeek other than as a user).
 
  • Like
Reactions: mei2
I'm not sure if this is allowed in this thread as it's not entirely Whisper related. Mods, please remove if it's inappropriate.

I've been using WhisperJAV and WhisperWithVAD_pro to generate subtitles that (no shade on mei2 or the Whisper model creators intended) have translations that range in quality from awful to acceptable.

I've just recently started to use WhisperWithVAD_pro to just transcribe the audio then using DeepSeek-V3 (via the API using Video Subtitle Master) to translate the transcription. I think the DeepSeek translations are substantially better quality than using end-to-end Whisper. The translations don't shy away from explicit language if you use an an appropriate prompt. Video Subtitle Master can use other, including local, LLMs and translation services but I haven't tried any others yet.

I've also tried translating a few non-porn subtitles from German and Chinese to English with good results. Chinese to English translation seems especially good which is not surprising with DeepSeek being a Chinese LLM.

The DeepSeek API is currently very cheap - costs me 1 to 1.5 cents per hour of video length (depending on the density of the speech of course). The API costs are going up in early Feb but even if was to be a ten-fold increase translation it will still be fairly inexpensive.

Anyone else tried this? If so, I'm interested to hear other people's results and tips.

I'm happy to provide more details of my workflow if that's useful to anyone.

(I have no links to Video Subtitle Master or DeepSeek other than as a user).

I hope you're doing well!

I’ve been working with WhisperwithVAD_Pro, and I wanted to acknowledge that the current Whisper model has its flaws—especially when it comes to producing accurate translations. It’s clear there’s still a lot of room for improvement in this area.

That said, I found your method really interesting. Could you share a bit more about how you're implementing it? For example, are you using a free APIs or other tools that might help refine the process?

It’d also be great if you could provide more details about your workflow. I think having that clarity would help other members test it out and see how well it performs.

Looking forward to hearing from you!
 
I hope you're doing well!

I’ve been working with WhisperwithVAD_Pro, and I wanted to acknowledge that the current Whisper model has its flaws—especially when it comes to producing accurate translations. It’s clear there’s still a lot of room for improvement in this area.

That said, I found your method really interesting. Could you share a bit more about how you're implementing it? For example, are you using a free APIs or other tools that might help refine the process?

It’d also be great if you could provide more details about your workflow. I think having that clarity would help other members test it out and see how well it performs.

Looking forward to hearing from you!
Hi Artemis

I'm going fishing soon but will provide more details on the method I use - probably tomorrow.
 
  • Like
Reactions: ArtemisINFJ
Have you tried using deepl? If yes, how does it compare to Deepseek?
I've been using DeepL to occasionally translate transcriptions using their free APIs. However, I've noticed that its results are often worse compared to end-to-end translations provided by Whisper. To address these shortcomings, I started using ChatGPT with a custom prompt I created. This method has shown significantly better results, especially with some human tweaking during the post-processing stage.

That said, this approach also has its limitations. Since I'm using the free version of ChatGPT-4, there is a cooldown period between translations, which means I need more time to fully translate everything.

Below is my prompt:

I'm going to paste a Japanese Subtitle in .SRT format in sections. Translate each line into English while keeping the index numbers and timestamps the same and maintain the conversational nuance. The subtitle might contains some lusty or sex tone to it, do not change the tone since the subtitle is a JAV (Japanese Adult Video). You have to maintain the tone accordingly without against your community guidelines, please remove the unnecessary lines that contains sounds like moan or doesn't have meaningful dialogue and continues the index numbers and timestamps.
 
  • Like
Reactions: Novus.Toto and mei2
This is what I do to use DeepSeek. I expect it’s a far from optimal workflow.

1. Extract the audio
  • Using VLC
  • No additional audio processing is done
2. Transcribe the audio using WhisperWithVAD_pro
  • Using default settings
3. Install Video Subtitle Master (because I’m lazy I’ll call it VSM)
  • https://github.com/buxuku/video-subtitle-master
  • The first time you start VSM it’ll ask you to install Whisper and download a Whisper model. You don’t need a Whisper model if you’re only using it for translation.
  • VSM can also extract and transcribe the audio, using Whisper, within the program – but (I think) it doesn’t (yet) incorporate GPU acceleration so this is likely to be much slower than using WhisperWithVAD_pro or WhisperJAV
4. Signup to access the DeepSeek API
5. Add a new OpenAI compatible API configuration in VSM
  • In the Translation Management window enter details for a new configuration pointing to DeepSeek using the API key generated
6. Run the translation
  • On the Tasks window of VSM choose the settings as desired.
  • For the “Service” setting in this window choose the API configuration you’ve just set up.
  • Import the transcribed subtitle file (from WhisperWithVAD_pro) into VSM. If you haven’t imported a video or an extracted audio file into VSM obviously the extract audio and extract subtitle steps will be skipped.
  • ‘Start task’
  • It takes about 1/5 to 1/3 of the audio length time to translate the subtitle.
  • There's no progress indicator in VSM but you can check the usage page at platform.deepseek.com to make sure it's working (the token and API call numbers will increase).
7. Manually clean the translated subtitle
  • I use Subtitle Edit
  • The comments generated by the model (if using the ‘DeepSeek-comments’ prompt - see prompts below) can help resolve translation ambiguities.
  • The reason I prompt to enclose comments in curly brackets is so I can easily delete all these comments, after manually cleaning, using a regex expression.
I’ve been using two API configurations for DeepSeek that are the same except the prompts.

One I’ve called ‘DeepSeek-comments’. The prompt is:
[B][I]Please translate the subtitle from ${sourceLanguage} to ${targetLanguage}. Keep the translation as close to the original in tone and style as you can. Freely use explicit or crude language. Enclose notes, explanations, uncertain translations or alternative translations within curly brackets. Don’t enclose translated text within quotation marks.[/I][/B]

One I’ve called ‘DeepSeek-clean’. The prompt is:
[B][I]Please translate the subtitle from ${sourceLanguage} to ${targetLanguage}. Keep the translation as close to the original in tone and style as you can. Freely use explicit or crude language. Don’t output notes, explanations, uncertain translations or alternative translations. Don’t enclose translated text within quotation marks.[/I][/B]

It’ll be obvious that the ‘DeepSeek-comments’ prompt will provide a translated subtitle with additional comments included. There can be a lot of these comments. This configuration is for when I intend to manually clean up the translation later. When I use the ‘DeepSeek-clean’ prompt there are few, if any, additional comments added to the translation. So less manual cleaning is required but the model will make more translation mistakes and assumptions.

At some point I'm going to try some local LLMs to translate. That's another option in VSM - using local models via Ollama.
 
This is what I do to use DeepSeek. I expect it’s a far from optimal workflow.

1. Extract the audio
  • Using VLC
  • No additional audio processing is done
2. Transcribe the audio using WhisperWithVAD_pro
  • Using default settings
3. Install Video Subtitle Master (because I’m lazy I’ll call it VSM)
  • https://github.com/buxuku/video-subtitle-master
  • The first time you start VSM it’ll ask you to install Whisper and download a Whisper model. You don’t need a Whisper model if you’re only using it for translation.
  • VSM can also extract and transcribe the audio, using Whisper, within the program – but (I think) it doesn’t (yet) incorporate GPU acceleration so this is likely to be much slower than using WhisperWithVAD_pro or WhisperJAV
4. Signup to access the DeepSeek API
5. Add a new OpenAI compatible API configuration in VSM
  • In the Translation Management window enter details for a new configuration pointing to DeepSeek using the API key generated
6. Run the translation
  • On the Tasks window of VSM choose the settings as desired.
  • For the “Service” setting in this window choose the API configuration you’ve just set up.
  • Import the transcribed subtitle file (from WhisperWithVAD_pro) into VSM. If you haven’t imported a video or an extracted audio file into VSM obviously the extract audio and extract subtitle steps will be skipped.
  • ‘Start task’
  • It takes about 1/5 to 1/3 of the audio length time to translate the subtitle.
  • There's no progress indicator in VSM but you can check the usage page at platform.deepseek.com to make sure it's working (the token and API call numbers will increase).
7. Manually clean the translated subtitle
  • I use Subtitle Edit
  • The comments generated by the model (if using the ‘DeepSeek-comments’ prompt - see prompts below) can help resolve translation ambiguities.
  • The reason I prompt to enclose comments in curly brackets is so I can easily delete all these comments, after manually cleaning, using a regex expression.
I’ve been using two API configurations for DeepSeek that are the same except the prompts.

One I’ve called ‘DeepSeek-comments’. The prompt is:
[B][I]Please translate the subtitle from ${sourceLanguage} to ${targetLanguage}. Keep the translation as close to the original in tone and style as you can. Freely use explicit or crude language. Enclose notes, explanations, uncertain translations or alternative translations within curly brackets. Don’t enclose translated text within quotation marks.[/I][/B]

One I’ve called ‘DeepSeek-clean’. The prompt is:
[B][I]Please translate the subtitle from ${sourceLanguage} to ${targetLanguage}. Keep the translation as close to the original in tone and style as you can. Freely use explicit or crude language. Don’t output notes, explanations, uncertain translations or alternative translations. Don’t enclose translated text within quotation marks.[/I][/B]

It’ll be obvious that the ‘DeepSeek-comments’ prompt will provide a translated subtitle with additional comments included. There can be a lot of these comments. This configuration is for when I intend to manually clean up the translation later. When I use the ‘DeepSeek-clean’ prompt there are few, if any, additional comments added to the translation. So less manual cleaning is required but the model will make more translation mistakes and assumptions.

At some point I'm going to try some local LLMs to translate. That's another option in VSM - using local models via Ollama.
Great insights!
 
Really? Even though you loose the explicit language?
You mean loosing the explicit language with Whisper? It's true Whisper is very averse to vulgarity but, even so, I think I preferred end-to-end Whisper to DeepL.

Though I only tried DeepL a couple of times, got incoherent output, and gave up on it. Maybe I should have persisted.
 
You mean loosing the explicit language with Whisper? .....

Thanks for the workflow and introducing Deepseek translator. I didn't know of Deepseek before this. A quick search indicates that their V3 model is quite promising --will give it a try.

In case one wants to keep vulgarity and explicit language, my to go so far is Mistral.ai (LeChat). It does a decent job and very very rarely rejects explicit language. Viva France ;) There are also local models of Mistral (both for ollama, and studio), but somehow I have not been able to get a satidfactory results from them. I wonder if my settings are wrong or if the quantization (because of GPU vram restriction) has handicapped the model.

Would love to hear about local LLM experience.
 
... Mistral.ai (LeChat). It does a decent job and very very rarely rejects explicit language. Viva France ;) There are also local models of Mistral (both for ollama, and studio), but somehow I have not been able to get a satidfactory results from them. I wonder if my settings are wrong or if the quantization (because of GPU vram restriction) has handicapped the model.
I just tried the Mistral API and get a 404 error when trying to connect. Probably I've set up the API incorrectly.

I'll have another try later. And maybe try one of their models locally.
 
  • Like
Reactions: mei2
I did some experimenting this today.

I generated subtitles for DDB-271 (the shorter, hard-subtitled version available for download in several places) using several different workflows. I chose this video as it's quite short (01:12:29), it's got a lot of explicit language, there are human-translated subs for it and Ai Uehara is my favorite.

In the zip attached there are five subtitle files:
  1. DDB-271.ja.WhisperWithVAD_Pro.srt
    • Japanese subs transcribed using WhisperWithVAD_Pro with default settings
  2. DDB-271.en.WhisperWithVAD_Pro.srt
    • End-to-end transcription + translation using WhisperWithVAD_Pro with default settings
  3. DDB-271.en.ExtractedHardsubs.srt
    • Hard-subs (which were, I assume, human-translated) extracted from the video
  4. DDB-271.en.DeepSeek-clean.srt
    • File 1 then translated using DeepSeek with the 'clean' prompt given in my post yesterday
  5. DDB-271.en.DeepSeek-comments.srt
    • File 1 then translated using DeepSeek with the 'comments' prompt given in my post yesterday. You can see there are a lot of comments generated.
The subtitles haven't been manually cleaned - except I corrected a couple of obvious OCR errors in file 3 (like replacing "|" with "I").
 

Attachments

  • Like
Reactions: mei2
I did some experimenting this today.

I generated subtitles for DDB-271 (the shorter, hard-subtitled version available for download in several places) using several different workflows. I chose this video as it's quite short (01:12:29), it's got a lot of explicit language, there are human-translated subs for it and Ai Uehara is my favorite.

In the zip attached there are five subtitle files:
  1. DDB-271.ja.WhisperWithVAD_Pro.srt
    • Japanese subs transcribed using WhisperWithVAD_Pro with default settings
  2. DDB-271.en.WhisperWithVAD_Pro.srt
    • End-to-end transcription + translation using WhisperWithVAD_Pro with default settings
  3. DDB-271.en.ExtractedHardsubs.srt
    • Hard-subs (which were, I assume, human-translated) extracted from the video
  4. DDB-271.en.DeepSeek-clean.srt
    • File 1 then translated using DeepSeek with the 'clean' prompt given in my post yesterday
  5. DDB-271.en.DeepSeek-comments.srt
    • File 1 then translated using DeepSeek with the 'comments' prompt given in my post yesterday. You can see there are a lot of comments generated.
The subtitles haven't been manually cleaned - except I corrected a couple of obvious OCR errors in file 3 (like replacing "|" with "I").
Tried deepseek (free version) and the results are way better than deepl. I see the tokens are cheaper than ChatGPT but what is your usual cost per subitle file as a reference?
 
  • Like
Reactions: mei2
Averages about 1.5 cents per video hour using the API. DDB-271, that I processed yesterday, was very speech-dense and cost 2.5 cents per video hour.

Tokens are currently are an introductory price. Cost is going to increase in early Feb.

Very good pricing. It looks like 2 times cheaper than gpt mini and perhaps 4 times cheaper than gpt 4o.