Too bad I can't get the ctranslate2 version to work with my GPU. I guess I could check and see how long it actually takes using CPU but it's probabl;y way too slow.
ctranslate2 uses faster-whisper and adds few more functionalities: colouring subtitles, live transcribe from microphone, and loading models from local folder. Non of these features are essential for quality of the subs.
You can just use faster-whisper directly. Here are 2 good options for faster-whisper:
Whisper & Faster-Whisper standalone executables for those who don't want to bother with Python. - GitHub - Purfview/whisper-standalone-win: Whisper & Faster-Whisper standalone executab...
github.com
Faster Whisper transcription with CTranslate2. Contribute to SYSTRAN/faster-whisper development by creating an account on GitHub.
github.com
They both run on windows nicely and quite fast with nvidia gpus. The first ons has some proprietary techniques that the author has not disclosed. It does a good job.
In terms of models (for Japanese), in my view:
large-v3: do NOT use. Wait for the next update.
large-v2: use this one for general purpose.
large-v1: use this one for cross talks and multiple speakers
Also note that almost all implementations use SileroVAD for preprocessing before feeding the audio chunks to the model. SileroVAD is designed by default to ignore background speakers / voices. It also is not very effective in total silences The author is working on a new version (ver 5) which is supposed to improve the current defficiencies.