Whisper and its many forms

SamKook

Grand Wizard
Staff member
Super Moderator
Uploader
May 10, 2009
3,594
4,963
The AI is just guesswork so sometimes it think noise is people speaking and sometimes it thinks people speaking is noise. That can happen even if it's loud enough.
AI as we have it(which is machine learning, not actual intelligence since it relies 100% on its training being accurate) will never be perfect so manual editing/fiddling will always be required.

If you don't tell avidemux to increase the volume, it just copies it as-is, which is the point of the demuxing part of my tutorial.
A problem that can happen if you increase the volume overall at all times is that it might makes some parts too loud to be recognized as speech so that should really only be used for specific portions.

What you want to have more speech be recognized is to play with the whisper settings themselves. There is no perfect settings to always use or it would be the default so it'll depend on what you want. More speech recognized also means more noise recognized as speech.

large-v3 will recognize more speech as large-v2 from what I saw but the quality of the translation doesn't seem to be as good so an easy way would be to use v3 to fill in what v2 misses.

There is a thread that discusses tuning the settings that you may want to look at here: https://www.akiba-online.com/thread...age-an-intro-guide-to-subtitling-jav.2115103/
 
Last edited:
  • Like
Reactions: Taako and composite

composite

Active Member
Jul 25, 2015
222
147
The AI is just guesswork so sometimes it think noise is people speaking and sometimes it thinks people speaking is noise. That can happen even if it's loud enough.
AI as we have it(which is machine learning, not actual intelligence since it relies 100% on its training being accurate) will never be perfect so manual editing/fiddling will always be required.

If you don't tell avidemux to increase the volume, it just copies it as-is, which is the point of the demuxing part of my tutorial.
A problem that can happen if you increase the volume overall at all times is that it might makes some parts too loud to be recognized as speech so that should really only be used for specific portions.

What you want to have more speech be recognized is to play with the whisper settings themselves. There is no perfect settings to always use or it would be the default so it'll depend on what you want. More speech recognized also means more noise recognized as speech.

large-v3 will recognize more speech as large-v2 from what I was but the quality of the translation doesn't seem to be as good so an easy way would be to use v3 to fill in what v2 misses.

There is a thread that discusses tuning the settings that you may want to look at here: https://www.akiba-online.com/thread...age-an-intro-guide-to-subtitling-jav.2115103/
Many thanks for the reply!
 

mei2

Well-Known Member
Dec 6, 2018
229
378
I use the colab and it works great. But sometimes it doesn't transcribe all the audio. Someone is speaking but there's no translated text.

Does anyone have suggestions how to fix that? Does avidemux not make the volume loud enough for the translation program to hear the audio? Or something else? thanks.

I second the earlier post by @SamKook. There are some techniques that can help you to capture the dialogs that are difficult to capture or missing, however they all have a trade-off. Here are some suggestions:

  1. Make sure your audio is good and there are no dropped frames or bad headers for the parts that you need to transcribe
  2. Decrease the VAD threshold. A value of 0.2 usually is good for details. I sometimes dial it down to 0.18
  3. Increase the values for beam_size and best_of
  4. Increase temperature value
  5. Increase the value for patience
  6. Normalise the audio volume --this one is tricky, you don't want to distort the spectogram. I suggest keeping it under -1 db.

The trade off with many of these is hallucination, and speed.

Reading your post, I suspect the missing dialog/lines are a side effect of the VAD. Especially if you're using Silero VAD. In my case I have started to drop VAD entirely. The developer of Silero VAD is working on a new version which is supposed to address the issue with dropping dialogs. I have been waiting for the new version for some time now. I hope he can finish his work soon.

Also, a quick workaround for the missing dialog is to use SubtitleEdit. Just select the parts that are missing and transcribe those using say whispercpp.
 
  • Like
Reactions: composite

composite

Active Member
Jul 25, 2015
222
147
I second the earlier post by @SamKook. There are some techniques that can help you to capture the dialogs that are difficult to capture or missing, however they all have a trade-off. Here are some suggestions:

  1. Make sure your audio is good and there are no dropped frames or bad headers for the parts that you need to transcribe
  2. Decrease the VAD threshold. A value of 0.2 usually is good for details. I sometimes dial it down to 0.18
  3. Increase the values for beam_size and best_of
  4. Increase temperature value
  5. Increase the value for patience
  6. Normalise the audio volume --this one is tricky, you don't want to distort the spectogram. I suggest keeping it under -1 db.

The trade off with many of these is hallucination, and speed.

Reading your post, I suspect the missing dialog/lines are a side effect of the VAD. Especially if you're using Silero VAD. In my case I have started to drop VAD entirely. The developer of Silero VAD is working on a new version which is supposed to address the issue with dropping dialogs. I have been waiting for the new version for some time now. I hope he can finish his work soon.

Also, a quick workaround for the missing dialog is to use SubtitleEdit. Just select the parts that are missing and transcribe those using say whispercpp.
Looking at the colab there's only 4 "required" settings:

audio_path, model_size, language and translation_mode.

Then 7 "advanced" settings:

deepl_authkey, source_separation, vad_threshold, chunk_threshold, deepl_target_lang, max_attempts and initial_prompt

How do I change the values for beam_size and best_of? And then temperature value and patience?
 
  • Like
Reactions: ArtemisINFJ

mei2

Well-Known Member
Dec 6, 2018
229
378
Looking at the colab there's only 4 "required" settings:
How do I change the values for beam_size and best_of? And then temperature value and patience?

I just made a fork of the WhisperWithVAD that exposes those parameters. Here:

WhisperWithVAD-PRO


I needed to call it something else than the original name, I went with PRO to indicate that there are more advanced settings for the users :)
I have tested it only on 2 test audios. Let me know if any issues.
If I make any new updates I will keep them in this repository:

 

ArtemisINFJ

God Slayer, Dawnbreaker
Nov 5, 2022
68
84
I just made a fork of the WhisperWithVAD that exposes those parameters. Here:

WhisperWithVAD-PRO


I needed to call it something else than the original name, I went with PRO to indicate that there are more advanced settings for the users :)
I have tested it only on 2 test audios. Let me know if any issues.
If I make any new updates I will keep them in this repository:

I've wanted to say thank you for your contribution especially WhisperJAV. Ngl it really helps a lot with creating an earlier versions of my fav jav. What's your plan for the next update for the future iteration?
 
Last edited:
  • Like
Reactions: Taako

mei2

Well-Known Member
Dec 6, 2018
229
378
I've wanted to say thank you for your contribution especially WhisperJAV. Ngl it really helps a lot with creating an earlier versions of my fav jav. What's your plan for the next update for the future iteration?

Yes, speed was the main purpose of WhisperJAV, to make subbing possible for the same day release. Speed was primary while quality was secondary.
I would like the next iteration/project to be about quality. I'm open to suggestions / ideas. I think there is a lot of room to improve the quality of subs by preparing the audio better. At any rate I am open to suggestions / feature requests.

PS. I looked into training / fine-tuning Whisper for jav but my early research shows that the loss will be higher than the gain.
 
  • Like
Reactions: ArtemisINFJ

ArtemisINFJ

God Slayer, Dawnbreaker
Nov 5, 2022
68
84
Yes, speed was the main purpose of WhisperJAV, to make subbing possible for the same day release. Speed was primary while quality was secondary.
I would like the next iteration/project to be about quality. I'm open to suggestions / ideas. I think there is a lot of room to improve the quality of subs by preparing the audio better. At any rate I am open to suggestions / feature requests.

PS. I looked into training / fine-tuning Whisper for jav but my early research shows that the loss will be higher than the gain.
I don't have any good ideas at the moment. But I highly support you on improving the quality of the next iteration since I find that the existing model that we commonly used aren't living up to its standard. The model usually get messed up on AV that have multiple complex scene which really takes away the experience of subtitling process.

PS. I have looked and try the WhisperVAD Pro of ur version. My early test comes out to be better than the existing WhisperVAD. But the transcoding process seemingly taking a bit longer than the usual.

May you share your early research results on your project for fine-tuning model for JAV. I do not have high skills on this matter but maybe I can help with others such as variables and etc
 

mei2

Well-Known Member
Dec 6, 2018
229
378
PS. I have looked and try the WhisperVAD Pro of ur version. My early test comes out to be better than the existing WhisperVAD. But the transcoding process seemingly taking a bit longer than the usual.

Yes, the PRO version is expected to be slower. There are 3 main reasons for that:
  • it uses word timestamps for more accurate timing,
  • it uses the new option provided by whsiper to reduce hallucination and repetition (hallucination_silence_threshold),
  • it uses higher threshold for patience to get more accurate word predictions (patience=2).

The whisper option hallucination_silence_threshold is still under development/refinement. It has a tendency for false positives. If you see some obvious lines are missing, you can reduce that value or remove it. As always, every option in whisper comes with a trade off :)
 

Dom047

New Member
May 5, 2016
9
4
test.JPG

keep gettin ffmpeg error on both the PRO and regular VAD versions.
The PRO also gave me an error before this that "source separation" wasn't defined even if its checked.

Anyways hope someone can help me solve this issue.
 

SamKook

Grand Wizard
Staff member
Super Moderator
Uploader
May 10, 2009
3,594
4,963
keep gettin ffmpeg error on both the PRO and regular VAD versions.
The PRO also gave me an error before this that "source separation" wasn't defined even if its checked.

Anyways hope someone can help me solve this issue.

My first guess would be that you didn't fill out the audio path properly, what did you put in that field?
 

Dom047

New Member
May 5, 2016
9
4
My first guess would be that you didn't fill out the audio path properly, what did you put in that field?
i usually do file upload but this time i tried linking my google drive and then providing the path that way. il try to run it again later and verify if its a path issue, i thought ffmpeg would have meant somethin else. i dont know much about this stuff but appreciate the input.
 

SamKook

Grand Wizard
Staff member
Super Moderator
Uploader
May 10, 2009
3,594
4,963
If you look at the line that creates the error closely, you see it's when ffmpeg loads the audio as an input(audio_path is the variable name that hold the path you provide) for splitting it into chunks(or pre-processing it, the code cuts there and I haven't looked at what it does exactly) for the VAD system.

It could be an issue with ffmpeg itself but everyone would have the same issue if it was and the other part of the equation is the only thing that requires a user input so it's much more likely to be a user input issue. It could be something else, but without more in depth information, it's always safer to go with the more likely option.
 
  • Like
Reactions: mei2

granca

Member
Mar 4, 2017
62
78
shorten the length of the audio file dramatically improve the reduction on hallucination in my experience, especially since large model tend to hallucinate a quite lot on long files (compared to base model).. script based on the old google model used to do that quite effectively.. the code i was using was also automatic a lot of the prep work on the audio file if i remember correctly.. i might be able to dig up the code from github if you are interested.
 

DScott

Well-Known Member
Jan 27, 2024
328
399
This question is directed at anyone who may know the answer as opposed to bothering SamKook yet again. I've been toying with Whisper AI (locally) and have had reasonable results. I typically use the 'model medium' setting because I did not find any reall value, accuracy wise, in using the 'large' setting. I understand taht you can do multiple conversions using the structure, whisper (I use quotations around the filename to account for spaces in the file name) so, format would be; whisper "abc-123.mp4" "abc-124.mp4" "abc-599.mp4" --model medium --task translate The most I've done this with is about 15 files but there's no reason to believe that I could not do more. the problem 'First World problems huh?' the problem is that you have to manually insert each file name and then the extension. I have a template that I use " Whisper “.mp4” “.mp4” “.mp4” “.mp4” “.mp4” “.mp4” “.mp4” “.mp4” “.mp4” “.mp4” “.mp4” --model medium –task translate
and i can add or remove individual spaces as needed. So, this is what I have been doing. What I am hoping that someone wiser than I can tell me if there's a way to configure whisper to convert an entire folder with a single command rather than entering each file name and extension individually. Thanks in advacne for any help. Incidentally this is a very small concern so no biggie if nobody knows.. Once question though, I guess this is directed at Sam, I'm about 2/3 complete with the ASW files and I have about 100 to go. I've done subs for several of these. Would these be a desired addition to my ul's or should I not bother? Cheers everybody
 

SamKook

Grand Wizard
Staff member
Super Moderator
Uploader
May 10, 2009
3,594
4,963
You can do a batch script(I'm assuming you're on windows), just copy paste this code into a text file, change the whisper exe location(this line, after the = :set "whisper=C:\Progs\whisper\whisper.exe") to where it is on your pc and save it with the bat extension. So instead of whisper.txt(which you'd get by default if you save a simple text file with notepad), you rename it whisper.bat or whatever name you want with a .bat extension, just make sure there's no hidden .txt extension after.
Code:
@echo off
setlocal enabledelayedexpansion

rem Define the video file extensions
set "video_extensions=.mp4 .avi .mkv .ts"
rem Whisper location
set "whisper=C:\Progs\whisper\whisper.exe"

rem Loop through each file in the current directory
for %%F in (*) do (
    rem Check if the file has a video extension
    for %%E in (%video_extensions%) do (
        if "%%~xF" == "%%E" (
            echo Processing: %%F
            "%whisper%" "%%F" --model medium --task translate
        )
    )
)

echo Job Done.
pause

It'll run every file located in the same folder as the bat file that ends with one of the extensions specified at the beginning of the script(add more if needed, it's this line: set "video_extensions=.mp4 .avi .mkv .ts") and run them one by one with whisper. Change the whisper command as needed, I just used the same thing you did in your post.
"%whisper%" "%%F" --model medium --task translate

"%whisper%" will be whatever you put as the path to whisper, "%%F" will be the video filename and the rest is as-is.


As for if you should add them to your posts, with how many people request subs on the forum, I'm sure people would like it and if they don't, it's easy enough to ignore. It's up to you though, you can include them in your posts or just do a big pack and post that on its own if you decide to post them.
 

DScott

Well-Known Member
Jan 27, 2024
328
399
You can do a batch script(I'm assuming you're on windows), just copy paste this code into a text file, change the whisper exe location(this line, after the = :set "whisper=C:\Progs\whisper\whisper.exe") to where it is on your pc and save it with the bat extension. So instead of whisper.txt(which you'd get by default if you save a simple text file with notepad), you rename it whisper.bat or whatever name you want with a .bat extension, just make sure there's no hidden .txt extension after.
Code:
@echo off
setlocal enabledelayedexpansion

rem Define the video file extensions
set "video_extensions=.mp4 .avi .mkv .ts"
rem Whisper location
set "whisper=C:\Progs\whisper\whisper.exe"

rem Loop through each file in the current directory
for %%F in (*) do (
    rem Check if the file has a video extension
    for %%E in (%video_extensions%) do (
        if "%%~xF" == "%%E" (
            echo Processing: %%F
            "%whisper%" "%%F" --model medium --task translate
        )
    )
)

echo Job Done.
pause

It'll run every file located in the same folder as the bat file that ends with one of the extensions specified at the beginning of the script(add more if needed, it's this line: set "video_extensions=.mp4 .avi .mkv .ts") and run them one by one with whisper. Change the whisper command as needed, I just used the same thing you did in your post.
"%whisper%" "%%F" --model medium --task translate

"%whisper%" will be whatever you put as the path to whisper, "%%F" will be the video filename and the rest is as-is.


As for if you should add them to your posts, with how many people request subs on the forum, I'm sure people would like it and if they don't, it's easy enough to ignore. It's up to you though, you can include them in your posts or just do a big pack and post that on its own if you decide to post them.
Always you Sam... should call you Yoda... Thanks I owe... Nothing new there... as for the subs, I've generated subs for several ASW files. I chose the files more-or-less arbitrarily by quick viewing bits of the vid and if there seemed to be some dialog beyond oh oh ahh ahh, I cummed, yadda yadda but if there seemed to be some actual dialog then I converted it. The result is that my methodology is completely random and subjective and has no organized structure. net result is I may have ASW-089, ASW-118, ASW-184, subbed but not the rest. I think what I'll do is if I made a sub for the file I'll include it with the upload, including what I've already uploaded and then in the future I tentatively will see if it's managable to start making subs for all of my subsequent posts. A question on that front though. The subs can be just ul'd as an attachment to the post, locally, right? Anywho. Again, thank you for the instructions on the .bat file..... oh, one more, one more thing, your instructions lists .ts as an available filetype to have whisper translate. I have been under the impression that whisper was not too fond of .ts and so I have been converting anyting that I want to translate to mp4. I am going to try to translate a .ts file in the morning and if it works I may not be around for a few days because of the concussion I willl have given myself slapping my head and going 'doh!".. Wish me luck.....
 
Last edited:

SamKook

Grand Wizard
Staff member
Super Moderator
Uploader
May 10, 2009
3,594
4,963
You can attach subs directly to the post but I think you need to zip the sub first to upload it since I don't think srt is in the allowed extensions list. They compress well so it's a good idea to do it even if you can add srt directly.

I added ts since I knew you had many but I've never tested any with whisper. I would assume it works since ffmpeg can handle them but ts is a pretty bad container so it's possible it could cause issues.