Whisper (OpenAI) - Automatic English Subtitles for Any Film in Any Language - An Intro & Guide to Subtitling JAV

So should I switch to whisper-ctranslate2 then? My computer has no issues running the normal whisper with the largest models, so I never tried any faster whisper versions before.

Edit: Installed it and will give it a go.
 
Last edited:
Well, I ran it. Not sure what went wrong, but instead of translating my audio the only result I got was "Detect language 'Japanese' with probability 1.000000"

No subtitle files were made, it only told me the movie is in Japanse, which I already knew.
 
Good choice to go with whisper-ctranslate2. Especially now that the original developer of fater-whisper has left to join Apple, I've been wondering what would be a good branch to move to that is still actively being deveoped. I'm guessing that the contributors to whisper-ctranslate2 will continue to maintain the new (systran) branch of faster-whisper. I'll give whisper-ctranslate2 a try.

On a pedantic :) note: the option for repetition_penalty is a feature of faster-whisper, not present in original Whisper.

Thanks, I don't pay enough attention to note which features are from whisper, faster whisper, ctranslate2, or whisper-ctranslate2. It all blurs together.

Well, I ran it. Not sure what went wrong, but instead of translating my audio the only result I got was "Detect language 'Japanese' with probability 1.000000"

No subtitle files were made, it only told me the movie is in Japanse, which I already knew.

There's some unfortunate bugs for Windows with faster-whisper or ctranslate2 that have never really been fixed, unfortunately.
 
After doing some testing with different settings, I have found that the program works using my CPU but fails to produce any output when I select CUDA. Since using the CPU is a lot slower I guess I'll have to go back to regular whisper
 
  • Like
Reactions: ArtemisINFJ
i have question what give you better result this or subtitle edit?

The short answer: I prefer WhisperJAV because it post processes the results of Whsiper to remove hallucination and repetition.

The long answer: it depends. SE provides many variants of Whisper and they are all very good:

  1. The original OpenAI Whisper (requires Python)
  2. Purfview's Faster-Whisper (Windows only)
  3. A simplified/optimized version called whisper.cpp (written in C++)
  4. A GPU optimized version of CPP with cuBLAS support (written in C++)
  5. A GPU optimized version called Whisper Const-me (written in C++)
  6. A optimized version called Whisper CTranslate2 (also known as FasterWhisper) (requires Python)
  7. Another Python version called stable-ts (requires Python)
  8. Another Python version called whisperX (requires Python)

WhisperJAV uses the option 6 in this list: FasterWhisper. One can produce better results with SE by creating multiple subs and to join them. For example running options (2) and (4), then remove hallucinations, then merge 2 srts together. Options 2 and 4 are both very fast and very good implementations.
 
  • Like
Reactions: BattousaiTheSlasher
The short answer: I prefer WhisperJAV because it post processes the results of Whsiper to remove hallucination and repetition.

The long answer: it depends. SE provides many variants of Whisper and they are all very good:

  1. The original OpenAI Whisper (requires Python)
  2. Purfview's Faster-Whisper (Windows only)
  3. A simplified/optimized version called whisper.cpp (written in C++)
  4. A GPU optimized version of CPP with cuBLAS support (written in C++)
  5. A GPU optimized version called Whisper Const-me (written in C++)
  6. A optimized version called Whisper CTranslate2 (also known as FasterWhisper) (requires Python)
  7. Another Python version called stable-ts (requires Python)
  8. Another Python version called whisperX (requires Python)

WhisperJAV uses the option 6 in this list: FasterWhisper. One can produce better results with SE by creating multiple subs and to join them. For example running options (2) and (4), then remove hallucinations, then merge 2 srts together. Options 2 and 4 are both very fast and very good implementations.
can you tell me what vad_threshold and chunk_duration do and what model is good to use?
 
and what is hallucination?

is it random words?
Words inferred when there are no words. Often time in JAV it will be sequences of sexual noises being interpreted as some really repetitive speech, but other times it can be some crazy interpretations.

various thresholds in the voice detection are the main control, but it is far from perfect.
 
  • Like
Reactions: mei2
Hallucinations can get pretty weird. Yesterday I got some moaning during sex translated as "Throw a long and thin piece of paper on the floor."
 
  • Haha
Reactions: Elmerbs
Hallucinations can get pretty weird. Yesterday I got some moaning during sex translated as "Throw a long and thin piece of paper on the floor."

Haha, hadn't seen that one before :) I'm guessing you used large-v3 model? That one produces some weired stuff.
 
can you tell me what vad_threshold and chunk_duration do and what model is good to use?
VAD is voice activity detection. Threshold is the sensitivity and chunk is the maximum length of voice parameters. See here for details : SileroVAD
If your video has a lot of music and noise increase VAD threshold.
If you want to have longer subtitle lines, increase VAD chunk.
 
Last edited:
On the subject of hallucination: I am collecting all occurancies of hallucination text. I add those to the filter in WhisperJAV but the collection can be used for any other filter too. Please post here or send me any new hallucination texts that you come across. At the moment all the filtered texts that I have collected from this forum are inside the code of WhisperJAV but I am planning to move them to a separate text file so it can be reused.

Credits: some time ago a very comprehensive list of hallucination text was posted by one of the members here. I have used that directly in WhisperJAV. I don't remember the name of the member who had put that together (and can't seem to find the original post either). Kudos and thank you to that member!
 
Haha, hadn't seen that one before :) I'm guessing you used large-v3 model? That one produces some weired stuff.
Yes, that is with V3. Am considering going back to V2 because I think that one is slightly better.

Too bad I can't get the ctranslate2 version to work with my GPU. I guess I could check and see how long it actually takes using CPU but it's probabl;y way too slow.
 
Too bad I can't get the ctranslate2 version to work with my GPU. I guess I could check and see how long it actually takes using CPU but it's probabl;y way too slow.

ctranslate2 uses faster-whisper and adds few more functionalities: colouring subtitles, live transcribe from microphone, and loading models from local folder. Non of these features are essential for quality of the subs.

You can just use faster-whisper directly. Here are 2 good options for faster-whisper:

They both run on windows nicely and quite fast with nvidia gpus. The first ons has some proprietary techniques that the author has not disclosed. It does a good job.

In terms of models (for Japanese), in my view:
large-v3: do NOT use. Wait for the next update.
large-v2: use this one for general purpose.
large-v1: use this one for cross talks and multiple speakers

Also note that almost all implementations use SileroVAD for preprocessing before feeding the audio chunks to the model. SileroVAD is designed by default to ignore background speakers / voices. It also is not very effective in total silences The author is working on a new version (ver 5) which is supposed to improve the current defficiencies.
 
  • Like
Reactions: BattousaiTheSlasher
The short answer: I prefer WhisperJAV because it post processes the results of Whsiper to remove hallucination and repetition.

The long answer: it depends. SE provides many variants of Whisper and they are all very good:

  1. The original OpenAI Whisper (requires Python)
  2. Purfview's Faster-Whisper (Windows only)
  3. A simplified/optimized version called whisper.cpp (written in C++)
  4. A GPU optimized version of CPP with cuBLAS support (written in C++)
  5. A GPU optimized version called Whisper Const-me (written in C++)
  6. A optimized version called Whisper CTranslate2 (also known as FasterWhisper) (requires Python)
  7. Another Python version called stable-ts (requires Python)
  8. Another Python version called whisperX (requires Python)

WhisperJAV uses the option 6 in this list: FasterWhisper. One can produce better results with SE by creating multiple subs and to join them. For example running options (2) and (4), then remove hallucinations, then merge 2 srts together. Options 2 and 4 are both very fast and very good implementations.
wow thank you for that WhisperJAV is amazing, do you know how there is large and medium models is there a model that is specific for jav?
 
wow thank you for that WhisperJAV is amazing, do you know how there is large and medium models is there a model that is specific for jav?
To my knowledge there are no models specifically trained for JAV --someone earlier in this forum was suggesting to embark on it. In general for Japanese you'd need a large model. The medium model does a decent job to give you a sense of the topic/dialogue but that's it.
 
i was using WhisperJAV and i got this "The secret `HF_TOKEN` does not exist in your Colab secrets." and it did not move after that is there a fix or i just have to wait?
 
i was using WhisperJAV and i got this "The secret `HF_TOKEN` does not exist in your Colab secrets." and it did not move after that is there a fix or i just have to wait?
I hate to be the 'me too' guy but...
I did try creating a token in HF (READ) and adding it to my secrets without any luck.

On another matter, why not add a:
runtime.unassign()
at the end of the code to make sure colab terminates immediately on finishing? This seems a little more efficient when waiting for a session time out and avoids losing credit. We'd probably need to lose the 'download as zip' human interaction at the end but that isn't a big deal - the files are still there in Google Drive, right?
 
To my knowledge there are no models specifically trained for JAV --someone earlier in this forum was suggesting to embark on it. In general for Japanese you'd need a large model. The medium model does a decent job to give you a sense of the topic/dialogue but that's it.
Like all things AI - its a good starting point :rolleyes: