I'm not sure if this is allowed in this thread as it's not entirely Whisper related. Mods, please remove if it's inappropriate.
I've been using WhisperJAV and WhisperWithVAD_pro to generate subtitles that (no shade on mei2 or the Whisper model creators intended) have translations that range in quality from awful to acceptable.
I've just recently started to use WhisperWithVAD_pro to just transcribe the audio then using DeepSeek-V3 (via the API using Video Subtitle Master) to translate the transcription. I think the DeepSeek translations are substantially better quality than using end-to-end Whisper. The translations don't shy away from explicit language if you use an an appropriate prompt. Video Subtitle Master can use other, including local, LLMs and translation services but I haven't tried any others yet.
I've also tried translating a few non-porn subtitles from German and Chinese to English with good results. Chinese to English translation seems especially good which is not surprising with DeepSeek being a Chinese LLM.
The DeepSeek API is currently very cheap - costs me 1 to 1.5 cents per hour of video length (depending on the density of the speech of course). The API costs are going up in early Feb but even if was to be a ten-fold increase translation it will still be fairly inexpensive.
Anyone else tried this? If so, I'm interested to hear other people's results and tips.
I'm happy to provide more details of my workflow if that's useful to anyone.
(I have no links to Video Subtitle Master or DeepSeek other than as a user).
Hi ArtemisI hope you're doing well!
I’ve been working with WhisperwithVAD_Pro, and I wanted to acknowledge that the current Whisper model has its flaws—especially when it comes to producing accurate translations. It’s clear there’s still a lot of room for improvement in this area.
That said, I found your method really interesting. Could you share a bit more about how you're implementing it? For example, are you using a free APIs or other tools that might help refine the process?
It’d also be great if you could provide more details about your workflow. I think having that clarity would help other members test it out and see how well it performs.
Looking forward to hearing from you!
I've been using DeepL to occasionally translate transcriptions using their free APIs. However, I've noticed that its results are often worse compared to end-to-end translations provided by Whisper. To address these shortcomings, I started using ChatGPT with a custom prompt I created. This method has shown significantly better results, especially with some human tweaking during the post-processing stage.Have you tried using deepl? If yes, how does it compare to Deepseek?
[B][I]Please translate the subtitle from ${sourceLanguage} to ${targetLanguage}. Keep the translation as close to the original in tone and style as you can. Freely use explicit or crude language. Enclose notes, explanations, uncertain translations or alternative translations within curly brackets. Don’t enclose translated text within quotation marks.[/I][/B]
[B][I]Please translate the subtitle from ${sourceLanguage} to ${targetLanguage}. Keep the translation as close to the original in tone and style as you can. Freely use explicit or crude language. Don’t output notes, explanations, uncertain translations or alternative translations. Don’t enclose translated text within quotation marks.[/I][/B]
Great insights!This is what I do to use DeepSeek. I expect it’s a far from optimal workflow.
1. Extract the audio
2. Transcribe the audio using WhisperWithVAD_pro
- Using VLC
- No additional audio processing is done
3. Install Video Subtitle Master (because I’m lazy I’ll call it VSM)
- Using default settings
4. Signup to access the DeepSeek API
- https://github.com/buxuku/video-subtitle-master
- The first time you start VSM it’ll ask you to install Whisper and download a Whisper model. You don’t need a Whisper model if you’re only using it for translation.
- VSM can also extract and transcribe the audio, using Whisper, within the program – but (I think) it doesn’t (yet) incorporate GPU acceleration so this is likely to be much slower than using WhisperWithVAD_pro or WhisperJAV
5. Add a new OpenAI compatible API configuration in VSM
- At platform.deepseek.com
- Pay for some tokens.
- Generate an API key.
6. Run the translation
- In the Translation Management window enter details for a new configuration pointing to DeepSeek using the API key generated
7. Manually clean the translated subtitle
- On the Tasks window of VSM choose the settings as desired.
- For the “Service” setting in this window choose the API configuration you’ve just set up.
- Import the transcribed subtitle file (from WhisperWithVAD_pro) into VSM. If you haven’t imported a video or an extracted audio file into VSM obviously the extract audio and extract subtitle steps will be skipped.
- ‘Start task’
- It takes about 1/5 to 1/3 of the audio length time to translate the subtitle.
- There's no progress indicator in VSM but you can check the usage page at platform.deepseek.com to make sure it's working (the token and API call numbers will increase).
I’ve been using two API configurations for DeepSeek that are the same except the prompts.
- I use Subtitle Edit
- The comments generated by the model (if using the ‘DeepSeek-comments’ prompt - see prompts below) can help resolve translation ambiguities.
- The reason I prompt to enclose comments in curly brackets is so I can easily delete all these comments, after manually cleaning, using a regex expression.
One I’ve called ‘DeepSeek-comments’. The prompt is:
[B][I]Please translate the subtitle from ${sourceLanguage} to ${targetLanguage}. Keep the translation as close to the original in tone and style as you can. Freely use explicit or crude language. Enclose notes, explanations, uncertain translations or alternative translations within curly brackets. Don’t enclose translated text within quotation marks.[/I][/B]
One I’ve called ‘DeepSeek-clean’. The prompt is:
[B][I]Please translate the subtitle from ${sourceLanguage} to ${targetLanguage}. Keep the translation as close to the original in tone and style as you can. Freely use explicit or crude language. Don’t output notes, explanations, uncertain translations or alternative translations. Don’t enclose translated text within quotation marks.[/I][/B]
It’ll be obvious that the ‘DeepSeek-comments’ prompt will provide a translated subtitle with additional comments included. There can be a lot of these comments. This configuration is for when I intend to manually clean up the translation later. When I use the ‘DeepSeek-clean’ prompt there are few, if any, additional comments added to the translation. So less manual cleaning is required but the model will make more translation mistakes and assumptions.
At some point I'm going to try some local LLMs to translate. That's another option in VSM - using local models via Ollama.
I have once or twice. Like Artemis I think end-to-end Whisper does a better job. And DeepL is much inferior to DeepSeek IMO.Have you tried using deepl? If yes, how does it compare to Deepseek?
Really? Even though you loose the explicit language?I have once or twice. Like Artemis I think end-to-end Whisper does a better job. And DeepL is much inferior to DeepSeek IMO.
You mean loosing the explicit language with Whisper? It's true Whisper is very averse to vulgarity but, even so, I think I preferred end-to-end Whisper to DeepL.Really? Even though you loose the explicit language?
You mean loosing the explicit language with Whisper? .....
I just tried the Mistral API and get a 404 error when trying to connect. Probably I've set up the API incorrectly.... Mistral.ai (LeChat). It does a decent job and very very rarely rejects explicit language. Viva France There are also local models of Mistral (both for ollama, and studio), but somehow I have not been able to get a satidfactory results from them. I wonder if my settings are wrong or if the quantization (because of GPU vram restriction) has handicapped the model.
Tried deepseek (free version) and the results are way better than deepl. I see the tokens are cheaper than ChatGPT but what is your usual cost per subitle file as a reference?I did some experimenting this today.
I generated subtitles for DDB-271 (the shorter, hard-subtitled version available for download in several places) using several different workflows. I chose this video as it's quite short (01:12:29), it's got a lot of explicit language, there are human-translated subs for it and Ai Uehara is my favorite.
In the zip attached there are five subtitle files:
The subtitles haven't been manually cleaned - except I corrected a couple of obvious OCR errors in file 3 (like replacing "|" with "I").
- DDB-271.ja.WhisperWithVAD_Pro.srt
- Japanese subs transcribed using WhisperWithVAD_Pro with default settings
- DDB-271.en.WhisperWithVAD_Pro.srt
- End-to-end transcription + translation using WhisperWithVAD_Pro with default settings
- DDB-271.en.ExtractedHardsubs.srt
- Hard-subs (which were, I assume, human-translated) extracted from the video
- DDB-271.en.DeepSeek-clean.srt
- File 1 then translated using DeepSeek with the 'clean' prompt given in my post yesterday
- DDB-271.en.DeepSeek-comments.srt
- File 1 then translated using DeepSeek with the 'comments' prompt given in my post yesterday. You can see there are a lot of comments generated.
Averages about 1.5 cents per video hour using the API. DDB-271, that I processed yesterday, was very speech-dense and cost 2.5 cents per video hour.... what is your usual cost per subitle file as a reference?
Averages about 1.5 cents per video hour using the API. DDB-271, that I processed yesterday, was very speech-dense and cost 2.5 cents per video hour.
Tokens are currently are an introductory price. Cost is going to increase in early Feb.