Hi all,
Nice to have found this thread. I've been using an app called Buzz for this workflow.
https://github.com/chidiwilliams/buzz
If you scroll down, you can see links to the installers for various platforms.
Once installed, you just drag and drop the video file onto it's window and then you get a popup of options to select the:
- Model: (Whisper, Whisper.cpp, Hugging Face, Faster Whisper, OpenAI),
- Whisper models include: Tiny, Base, Small, Medium, Large, LargeV2, LargeV3 and LargeV3-Turbo;
- Task: Translate or Transcribe.
- Language: Lots of Languages here, including Japanese and Javanese
- Advanced options: set the Speech Recognition Temperature, Prompts, enable AI Translation, models and instruction
It will then exports as TXT, SRT or VTT with or without Word-level timings into the folder where the source video or audio file is. Timings obviously depend on the movie lengths, but from my history I can see: 31m, 23m, 13m, 9m, 11m, 14m, 34m, 15m to mention a few examples.
It sounds like the results are as per what has been reported here. Hit n' miss sometimes. Duplication. Periods of nothing (which I assume is down to the quality of the source material). I've tried extracted the audio, running it through an AI to strip out everything but the vocals (I used DJ Studio for this), but it didn't seem to affect the results too much.
Also thinking there may be some NSFW filtering going on, as the text says "sorry", "not sure I can do this"... I don't think it's the scenes every time!! As it's on github, I may take a look at the source and see if that can be turned off.
So, like current, the end file still requires work. However, the results are good enough for me to understand what is being said. Just doing a few bits of testing, I found the LargeV2 model I think gives the best results.
I'm not affiliated in any way. Just thought, I'd mention it just in case it makes life easier.
Cheers.