Whisper and its many forms

It seems to be an incompatibility with the newer pytorch version and the VAD code. Not sure how to fix it yet.

The problematic function seems to be this: https://pytorch.org/docs/stable/hub.html#torch.hub.load and I don't see an "onnx" parameter(but probably is a custom parameter for VAD) in that doc, but removing it doesn't help.
I don't see any other obvious problems with it so would require knowing more about the code.

You can always try to open an issue on the github page of the guy who made it: https://github.com/ANonEntity/WhisperWithVAD/issues

Might also be an issue with whisper not supporting "triton-2.1.0" yet and waiting for a new version of it will fix it all since it does seem to force install an older version if we install the new one manually before, hard to say exactly and I can't mess too much with stuff atm.
I think this is code he's using: https://colab.research.google.com/github/ANonEntity/WhisperWithVAD/blob/main/WhisperWithVAD.ipynb
 
The code is available on the colab page directly if you simply press "show code" and on the github I linked if you exit the issues page, but I only had a quick look since I can't really stay on my computer much these days but it seems to be a whisper requirement issue instead of that code.

Seemed to be that it's refusing to install triton-2.1.0 required for the new pytorch version when I had to stop messing around with manually installing packages by modifying the code before running it and I haven't messed around with it more since.
 
The code is available on the colab page directly if you simply press "show code" and on the github I linked if you exit the issues page, but I only had a quick look since I can't really stay on my computer much these days but it seems to be a whisper requirement issue instead of that code.

Seemed to be that it's refusing to install triton-2.1.0 required for the new pytorch version when I had to stop messing around with manually installing packages by modifying the code before running it and I haven't messed around with it more since.
Yeah, had the same observation. Seems Whisper is really strict with the package range they allow and manually forcing other versions results only in more/other issues.
 
Mei2 downgraded pytorch and it works. You can use his fixed version while the original is broken:

@SamKook , guys, I went ahead and fixed the Whisper Silero VAD for now. This should work. However, because I didn't have editor rights to the original one, I had to make a copy and to publish it from my github. I hope the orignal owner (aNoEntity) does not mind. I'm sure he doesn't.


WhisperWithVAD - maintenance release
It is saved here: https://github.com/meizhong986/WhisperJAV/tree/main/notebook

I have called the new version as a manitenance release, so it doesn't get mstaken from the original.

PS. if any issues let me know. I don't have much bandwidth but will try my best to fix it if this gets broken again.
PPS. the installation step takes a bit longer in this version because the cleanup, stay patient, it takes ca 3 minutes in total.
 
  • Like
Reactions: mei2
Since whisper has spawned many tweaked versions since its introduction and that so many people keep asking questions on how to use it, I figured I'd make a thread dedicated to it.

If you have questions about how to use it or you want me to add something I forgot/don't know about, just let me(or anyone else following) know in this thread.

Just click on the spoiler tag to open or close a section.

FAQ​

Many thanks for these great instructions. I got it working finally and am impressed with Whisper's functionality and accuracy in transcribing the spoken language. Still, the actual translation by DEEPL is lacking in my opinion. I guess they apply some filters to eliminate slang, specifically JAV slang.

Question: What would I have to write in the field "deepl_target_lang" to get the transcription in Japanese only?

Just "JP" or "JP-JP"?

Would this work at all? Thanks again for your valuable input on this!
 
Many thanks for these great instructions. I got it working finally and am impressed with Whisper's functionality and accuracy in transcribing the spoken language. Still, the actual translation by DEEPL is lacking in my opinion. I guess they apply some filters to eliminate slang, specifically JAV slang.

Question: What would I have to write in the field "deepl_target_lang" to get the transcription in Japanese only?

Just "JP" or "JP-JP"?

Would this work at all? Thanks again for your valuable input on this!

It doesn't use deepl by default to translate, it uses its own internal translation thing(as far as I can tell). The issue is that the training models are heavily trained on youtube(most likely) which has basically no adult content so that's why it's not great with more adult themes.

To get just a transcription, you actually want to change "translation_mode" to "No translation".

As for that deepl setting(which you don't need to change), it would be simply be "JA", but you need a deepl authentication key that's not free to use deepl.

Here's the possible language settings for it(for those wondering):
Code:
BG - Bulgarian
CS - Czech
DA - Danish
DE - German
EL - Greek
EN - English (unspecified variant for backward compatibility; please select EN-GB or EN-US instead)
EN-GB - English (British)
EN-US - English (American)
ES - Spanish
ET - Estonian
FI - Finnish
FR - French
HU - Hungarian
ID - Indonesian
IT - Italian
JA - Japanese
KO - Korean
LT - Lithuanian
LV - Latvian
NB - Norwegian (Bokmål)
NL - Dutch
PL - Polish
PT - Portuguese (unspecified variant for backward compatibility; please select PT-BR or PT-PT instead)
PT-BR - Portuguese (Brazilian)
PT-PT - Portuguese (all Portuguese varieties excluding Brazilian Portuguese)
RO - Romanian
RU - Russian
SK - Slovak
SL - Slovenian
SV - Swedish
TR - Turkish
UK - Ukrainian
ZH - Chinese (simplified)
 
  • Like
Reactions: Safadinho
Mei2 downgraded pytorch and it works. You can use his fixed version while the original is broken:

@SamKook , I wonder if you've had a chance to play with the large-v3 model. To me it produces more hallucination and repetition. Many of the hallucination lines are new, I guess from new youtube sources used to train large-v3.

On that topic: I added options large-v1, large-v2, and large-v3 to the maintenance release so user can choose. Funnily enough for the videos with cross talks and bad audio, large-v1 seems to be performing better on picking up cross talks.

Off the topic: I fixed a bug in hallucination removal of whisperjav. Version 06g has so far the strongest hallucination and repetition removal.
 
Haven't tried it at all yet.

I mostly only used whisper to mess around. It's not reliable enough for my taste since you can't get the same result twice(not with default settings at least) and fixing a sub to my liking would require more work than I have time to allocate to it.

After hearing about ChatGPT new version being worse in many ways, it doesn't surprise me that new does not equal better for those models too, but since I always get a different result, I've never been able to reliably compare even different settings from the same model.
 
  • Like
Reactions: mei2
i

s only me, or this downgraded version has a worse translation ?

It should not have any impact to the translation. Can you check which model_size you're using? If "large" switch that to "large-v2". There are many comments that large-v3 is worse than large-v2. That might be the cause if you see any degredation.

PS. The large-v2 and large-v3 options are added in the latest maintenance release.
 
It's technically not really downgraded, it's simply not using the new incompatible version, just keeping about the same version that always worked so other than the new large model whisper added, it should be the same as before.
 
ic, because i try to translate SNIS-668 and i feel a lot of words are getting mistranslated
 
It should not have any impact to the translation. Can you check which model_size you're using? If "large" switch that to "large-v2". There are many comments that large-v3 is worse than large-v2. That might be the cause if you see any degredation.

PS. The large-v2 and large-v3 options are added in the latest maintenance release.
Is there a size limit to the files that it can transcribe if I upload the audio to Google Drive and mount it to run on the interface? I've been using the original collab.. and my transcribing method is a little tenacious, so I always end up unloading large file sizes to the system, but it produces good results. The main issue is that I frequently run out of resources before the upload procedure is complete, forcing me to settle for small-sized audio with poor quality.
 
Is there a size limit to the files that it can transcribe if I upload the audio to Google Drive and mount it to run on the interface? I've been using the original collab.. and my transcribing method is a little tenacious, so I always end up unloading large file sizes to the system, but it produces good results. The main issue is that I frequently run out of resources before the upload procedure is complete, forcing me to settle for small-sized audio with poor quality.

Uploading your files to Google Drive and mount it for running is the "robust" way of doing things. I suggest not to reduce the quality of your audio but choose the right audio conversion/format methods. I'd suggest to follow the audio preparations step by @SamKook here: https://www.akiba-online.com/threads/whisper-and-its-many-forms.2142559/

If you continue running out of resources, take a look at whisperJAV option. Its quality is not as good as the whsipervad but resource usage is half and speed is double.
 
  • Love
Reactions: ArtemisINFJ
Uploading your files to Google Drive and mount it for running is the "robust" way of doing things. I suggest not to reduce the quality of your audio but choose the right audio conversion/format methods. I'd suggest to follow the audio preparations step by @SamKook here: https://www.akiba-online.com/threads/whisper-and-its-many-forms.2142559/

If you continue running out of resources, take a look at whisperJAV option. Its quality is not as good as the whsipervad but resource usage is half and speed is double.
Thank you, I'll check it out. I noticed that Whisper really good with certain type of audio that contain minimal characters in a video. It can quite capture the raw transcript, but still struggling to identify or separate from background voices that involves a multiple speakers at once. I do not condemn the technology as if it still in early development and there are room for improvement but I'd hope it'll catching up soon. As for the model, is it possible for us to train our own model ? I've been thinking about it lately
 
Last edited:
As for the model, is it possible for us to train our own model ? I've been thinking about it lately

I've been thinking about the same thing, and although it is possible to fine-tune the models --train them with additional data sets, creating those data sets seem to be the crucial part. We need "precise" transcription, or translation of Jav that can be used as training data set.
 
  • Like
Reactions: ArtemisINFJ
We need "precise" transcription, or translation of Jav that can be used as training data set.
Any ideas on how one's can do it? We can train specialized model just for JAV and it would be a great cause for the fandoms. I've tested every model available at the moment. It's seem that a mix between v3 and v2 would be significant, as I have produces good quality of raw transcription by combining both of the results.
 
Last edited by a moderator:
I use the colab and it works great. But sometimes it doesn't transcribe all the audio. Someone is speaking but there's no translated text.

Does anyone have suggestions how to fix that? Does avidemux not make the volume loud enough for the translation program to hear the audio? Or something else? thanks.