Whisper (OpenAI) - Automatic English Subtitles for Any Film in Any Language - An Intro & Guide to Subtitling JAV

To my knowledge there are no models specifically trained for JAV --someone earlier in this forum was suggesting to embark on it. In general for Japanese you'd need a large model. The medium model does a decent job to give you a sense of the topic/dialogue but that's it.

@mei2

Is this a new thing? Never received that warning before:

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:72: UserWarning: The secret `HF_TOKEN` does not exist in your Colab secrets. To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. You will be able to reuse this secret in all of your notebooks.

config.json reach 100% but stop there, nothing after. Never had to use HF Token with your collab before.

Version used is v0_6i


-Besh
 
Last edited:
Is this a new thing? Never received that warning before:

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:72: UserWarning: The secret `HF_TOKEN` does not exist in your Colab secrets. To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. You will be able to reuse this secret in all of your notebooks.

config.json reach 100% but stop there, nothing after. Never had to use HF Token with you collab before.

Version used is v0_6i


-Besh

Yes, I just received an issue report on Github too. It seems that something got broken in colab environment. Interestingly Google hasn't announced any official upgrades to colab since 18th December, but they have changed the security policy and some other updates, it seems.

I haven't been able to find a workaround for it yet. Will still dig into it.

cc: @dickie389389 , @bobe123 , @MrKid
 
  • Like
Reactions: MrKid and Besh
cc: @dickie389389 , @bobe123 , @MrKid , @Besh

Here is a quick workaround: version 1.0 (beta)



The root of the error seems to be a code break caused by the recent Colab upgrade: mayankmalik-colab .
In this beta release I have replaced faster-whisper with stable-ts package. For now, this release does not use the quantised model for faster speed. We have to wait untill the rot cause is solved to go back to that.

I have tested this relase only on a handful of test audios. If you come across any issues please let me know.
 
  • Like
Reactions: Besh and MrKid
Epanding on my previous post, for those who prefer faster whisper (2x speed):

There is a workaround to run 0.6i (faster whisper). The workaround is not the smoothest but it does the job. Follow the gudiline here:


Steps:
1. Run the cells (or all cells) once.
2. Select 'tools' (in the menu bar above) --> 'Command palette'.
3. Select 'use fallback runtime version'.
4. Run the cells (or all cells) again.

Note: the fallback environment is supposed to be a temporary solution. The google rep in the colab community says that they're doing their best to keep the environment up longer than usual, but eventually they need to shut it down. I hope by then we will have a robust solution.
 
Last edited:
Any chance you will add the V3 model

My assessment so far:
large-v3: do NOT use. Wait for the next update.​
large-v2: use this one for general purpose.​
large-v1: use this one for cross talks and multiple speakers​

I'd wait untill the Whisper main branch is updated with necessary fixes. There are few fixes suggested to reduce V3's wild hallucination but still not ready for the prime time. Even the most robust fixes that have been suggested so far only make the results of V3 to be equal to V2 in quality. So no gain.

It's a pity that V3 has been such a big disappointment. OpenAI estimated that V3 would improve Japanese WER by 18%, but I guess they failed to measure that hallucination got worst by 200% :)
 
My assessment so far:
large-v3: do NOT use. Wait for the next update.​
large-v2: use this one for general purpose.​
large-v1: use this one for cross talks and multiple speakers​

I'd wait untill the Whisper main branch is updated with necessary fixes. There are few fixes suggested to reduce V3's wild hallucination but still not ready for the prime time. Even the most robust fixes that have been suggested so far only make the results of V3 to be equal to V2 in quality. So no gain.

It's a pity that V3 has been such a big disappointment. OpenAI estimated that V3 would improve Japanese WER by 18%, but I guess they failed to measure that hallucination got worst by 200% :)
ok thanks much appreciated.

another question: what does "chunk_duration:" do, the default is 4 what does increasing or decreasing it do?

also "I'm going to sleep." is a phases said a lot in error.
 
Last edited:
Does anybody else experiences timing issues fairly often when using WhisperWithVAD (up to ~30sec sometimes)?
 
Are you using large-v2 model? And not large-v3 (and of course "large").
Yes, large-v2. Was also using large-v1 and large-v3 but all of them have the same flaw. There seems no apparent reason. Sometimes it is perfectly fine but then completely out of sync. Even retrying does not solve the issue.
 
My assessment so far:
large-v3: do NOT use. Wait for the next update.​
large-v2: use this one for general purpose.​
large-v1: use this one for cross talks and multiple speakers​

I'd wait untill the Whisper main branch is updated with necessary fixes. There are few fixes suggested to reduce V3's wild hallucination but still not ready for the prime time. Even the most robust fixes that have been suggested so far only make the results of V3 to be equal to V2 in quality. So no gain.

It's a pity that V3 has been such a big disappointment. OpenAI estimated that V3 would improve Japanese WER by 18%, but I guess they failed to measure that hallucination got worst by 200% :)

V3 requires very different settings than V2 does, so I think VAD should be updated or tweaked to account for the new settings. Something like the calibration being different (0.5 threshold of one model corresponding to 0.55 on the other) would lead to large differences in hallucinations.

I think it is useful to do a basic run that uses the whisper-ctranslate2 / faster-whisper defaults, mostly, and then doing a second run with a lower threshold, a reduced length penalty, and higher repetition penalty.
 
--repetition_penalty , at least in the ctranslate2 faster whisper, deals well with large-v3 hallucinations. Setting it high to something like 1.5 deals with most of the regular hallucinations where phantom lines are repeated over and over, but there will still be some hallucinations. I also turn down the length_penalty to 0.7, but that is less crucial.

I'm having good luck with setting the vad_threshold low (like 0.2 or 0.3), but using a high --patience or --best_of.

High patience (like, 15 or 20) seems to help with timings as well. Some of the Slave Color entries I did not spend a lot of time editing before get way better first drafts with large-v3, and the file size is smaller and fewer lines. It misses less but has a way better basic interpretation of what is being said in a lot of the series verbal torments.
 
Last edited:
has anyone compared the quality of subs produced by using google live captions(beta) with whisper .
 
guys im having this problem , anyone knows the reason?


RuntimeError: CUDA failed with error CUDA driver version is insufficient for CUDA runtime version

and also im having another error while trying the connect gpu , is that related to membership ? I thought its free
Unable to connect to GPU backend
You cannot connect to the GPU at this time due to usage limits in Colab. Get information
For more GPU access, you can purchase a Colab processing unit with Pay As You Go.
 
I find that using Subtitle Edit you can do this much better with a GUI. No need for any fancy installation or python, etc. It has audio-to-text option where you can use different models and whisper engines.
 
I find that using Subtitle Edit you can do this much better with a GUI. No need for any fancy installation or python, etc. It has audio-to-text option where you can use different models and whisper engines.
i like Subtitle Edit but the quality of the subs is a problem for me. which model are you using?