The Akiba-online English Sub Project★NOT A SUB REQUEST THREAD★

It is impressive but also very random with the results so it's more of a good starting point than a full solution in my opinion, but the result is good enough for many.

There's a few collab options that allow people without a top class GPU to use the bigger models like the one mentioned in the other subtitle thread on the forum, here's the last mention of it: https://www.akiba-online.com/thread...not-a-sub-request-thread.1466451/post-4632709
That thread has been talking about whisper stuff for the past few months so lots to read on that over there.

I have no idea how to run those custom models, I've only been doing tests to help others since my initial batch of testing to see how well it worked so you just taught me about them.

Finding the best settings is hard since the randomness of the result mean you never know if the settings gave you a good/bad result or it's just random you got a good/bad one.
Thank you, I saw that thread but it seemed like Whisper is really just piggybacking on that thread. Whisper optimization should probably be its own thread but for now I'll post there.
 
Yeah, in theorie the other thread is just for posting subs but it's where most people discuss sub related stuff. You could start a new thread about whisper settings optimization and link it over there so people can easily find it.
 
Hello my friends and ladies,
Yes because some of us are:p
Anyways, I was subbing and being hard of hearing I came across another confusing word. Honestly, during sex scene the passionate actors words are sometimes confusing.
So here is what confuse me...
1. のかわからない(No ka wakaranai) = I don't know, Don't know, You know, Understand?, (I)Got it, Do you understand? and etc.
However...
2. He was really saying, 我慢できない(Gamandekinai) = I can't stand it(any longer), I can't take it(anymore), It's too much(to bear), I've reached my limit, i can't bear it. i can't put up with it. i can't hold it any longer. i can't wait any longer and etc.

Just listen and study the scene:p
 
Hello my friends and ladies,
Yes because some of us are:p
Anyways, I was subbing and being hard of hearing I came across another confusing word. Honestly, during sex scene the passionate actors words are sometimes confusing.
So here is what confuse me...
1. のかわからない(No ka wakaranai) = I don't know, Don't know, You know, Understand?, (I)Got it, Do you understand? and etc.
However...
2. He was really saying, 我慢できない(Gamandekinai) = I can't stand it(any longer), I can't take it(anymore), It's too much(to bear), I've reached my limit, i can't bear it. i can't put up with it. i can't hold it any longer. i can't wait any longer and etc.

Just listen and study the scene:p
is it nan ka .... wakaranai
nan ka ( used when they cant explain what they feeling ) ( something like .... i dont know but ....... )

i heard people start with " nan ka " when they was asked in varity-show
 
i heard people start with " nan ka " when they was asked in varity-show
It's a filler word. Like "you know" in english, when it's used to give the speaker time to gather its thought.
 
  • Like
Reactions: Jeet Roi and Taako
is it nan ka .... wakaranai
nan ka ( used when they cant explain what they feeling ) ( something like .... i dont know but ....... )

i heard people start with " nan ka " when they was asked in varity-show
That could be something to look for in the future and thanks @darksider59 for explaining it.

But in this instance, he was clearly saying, 我慢できない Gamandekinai.

Honestly, it did not sound like 'nanka' wakarani...after i listen with my headphones.
I'm embarrass to say, I don't even know where 'wakarani' came from in my first pass of the sub.
I guess my hearing is getting bad.:p

After I increase the volume on my headphones, it was obvious to what he was saying, "he couldn't stand it".:)

*It's why I suggested way back, people should wear theirs when subbing.
I don't trust my hearing, and when I view my works later, I get upset, how I mistake an obvious word.
 
Last edited:
  • Like
Reactions: darksider59
I ordered a new gaming computer, which should arrive tomorrow. Since whisper seems to like the same hardware as gaming computers have, I figured I should install that too.

Is there a step by step guide zomewhere that includes running VAD? Or is that something you can't do locally? The guide mentioned at the start doesn't de VAD as far as I can tell.

Also, which model is better, large or large2? (Or is there an even better one and if so what's it called?)
 
The colab itself is a step by step guide, it literally does the whole installation process every time someone uses it.

The only issue is that it's for linux and not windows. I don't know if someone here made a guide to install it on windows but shouldn't be too hard to figure out with a little technical knowledge.

And the default is large2 so that's probably the better one. Someone who uses it more might be able to say for sure though.
 
I’ve been using whisper with VAD on Google colab and I’ve been very happy with the results but I noticed lately the timing of the subtitles drifts as the movie goes on until they are completely out of sync anyone else having this problem? Any solutions?
 
I’ve been using whisper with VAD on Google colab and I’ve been very happy with the results but I noticed lately the timing of the subtitles drifts as the movie goes on until they are completely out of sync anyone else having this problem? Any solutions?

I have come across constant shifting in two situations:

a) Bad audio source: If the source of the movie has had bad audio coding, during the audio extraction there will be frame drops which causes the extracted audio not to be in sync with the original movie source. You can test that by matching the extracted audio with the original movie, or if you use ffmpeg to extract the audio it will print out all dropped frames and bad headers.

b) Mismatch frame rate: This happens if the extracted audio by some error has a different frame rate than the movie. This is rare and I don't think it's the cause in your case. By any chance are you using a .ts file as your movie source?

Happy subbing
 
I have come across constant shifting in two situations:

a) Bad audio source: If the source of the movie has had bad audio coding, during the audio extraction there will be frame drops which causes the extracted audio not to be in sync with the original movie source. You can test that by matching the extracted audio with the original movie, or if you use ffmpeg to extract the audio it will print out all dropped frames and bad headers.

b) Mismatch frame rate: This happens if the extracted audio by some error has a different frame rate than the movie. This is rare and I don't think it's the cause in your case. By any chance are you using a .ts file as your movie source?

Happy subbing
It’s odd, I see that my extracted audio file is shorter in length that the movie mp4 file. I must be losing time in the audio extraction. I’ve tried extracting audio with VLC ( too mp3) and audio extractor ( to acc) with the same issue. It’s happened on many files do I don’t think It’s a corrupt file issue, I wonder if my system is introducing this discrepancy due to strain on the system during extraction?
 
... I wonder if my system is introducing this discrepancy due to strain on the system during extraction?

The most straigh forward workaround is to download a different source --you don't need an HD source, even a 480p source will give you decent audio for transcription.
 
Sounds like you're re-encoding the audio to another format instead of actually demuxing(extracting) it and potential loss of sync is a reason to avoid doing that.

Slightly edited quote of mine for what to use do achieve that:
For mkv you'd use mkvtoolnix(easiest too since it accepts most input file and you just extract as mka which is a version of mkv but for audio only), mp4box with some GUI(don't know of a recent one though) for it for mp4, asfbinwin for wmv and avidemux for avi(which should be handle mp4 and wmv too maybe). But there's other options too.

Edit: Just to clarify something I forgot to say, demuxing wouldn't fix a delay caused by a corrupt audio source but if it's slowly getting worse over time it's less likely to be caused by corruption and more by a poor decoding filter used for re-encoding or pc issues.
Whisper timing is pretty bad to begin with so it could also be at fault but shouldn't be worse than before.
 
Last edited:
  • Like
Reactions: avatarthe and mei2
The colab seems to always re-encode the audio. Is there some format (maybe FFMPEG command line arguments?) where it will skip that step? Is there some kind of pre-processing (volume, compression etc.) that can be done in FFMPEG that gives better results, or is the original file always the best?
 
I can't say about the colab version since I haven't looked into its code very much, but I know it's splitting the audio in many parts so it has to modify it.

With that said, whisper itself, internally, is converting the audio to 16k mono wav so no matter what, it'll end up as that and since wav is a lossless format, it doesn't matter once it reaches that step if it gets re-encoded since there will be no quality loss anymore.
Here's the code portion that does it. The "sr" value for ar is a variable set to 16000.
Code:
# This launches a subprocess to decode audio while down-mixing and resampling as necessary.
# Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
out, _ = (
    ffmpeg.input(file, threads=0)
    .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
    .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)
)

So I would assume the VAD part of the colab is also converting to that since something else wouldn't make sense.

If you have to modify the audio somehow(like boost the audio for example), you should save it to that directly to avoid any loss in quality but if you're leaving it unmodified, there should be no reason to pre-convert it before uploading, it would likely take more space(256kbps and most video rip on here are equal or lower to that) and the quality would be identical.
With ffmpeg, it would look like this:
Code:
"ffmpeg.exe" -i "input.mp4" -vn -ac 1 -ar 16000 "output.wav"
But you need the full path for anything not in the current folder from the command window.

I barely use whisper so I haven't tested this a lot but I know audio formats and encoding well enough to say that unless either whisper, VAD or the colab combination of the 2 does something I'm not aware of(would need to be something stupid so I doubt it), what I said above is true for sure.
 
I get the following error when im using whipser in CMD:
NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/sta...bject-mode-fall-back-behaviour-when-using-jit for details.
def backtrace(trace: np.ndarray):


It gives me that error but then it translates the audio file successfully.Should i igonre it?