I can't seek when playing that video so it's a sign that it is corrupted somewhere and probably is the source of your issue with it. Oups, I messed up copying the file apparently. (mpc-hc wouldn't load the video properly for some reason, probably some of the many audio files in the same folder interfering with it.)
I am still doing some tests with the colab but whisper is so inconsistent that it's hard to tell. I did download the 1080p version too
and that one isn't broken so if you want to test with that audio yourself, here it is:
mega.nz
Edit: A comparison of your audio on the left(I get an identical file if I extract it myself with avidemux) - the audio in a mka container extracted with mkvtoolnix in the middle - the audio extracted by mp4box(different hash than avidemux) on the right.
The red bar on the right marks that there is a difference on the line, which is basically always. it's green, blue or purple if it finds an identical line from at least 2 of the files.
The one on the left is slightly out of sync(less than a sec) if we look at the "Can I lick you?" line on all 3(it's lower in the middle) compared to the other 2. I assume you meant a bigger delay than that and I don't even know if the other 2 match the video, they could be worse.
View attachment 3324032
Edit2: 1080p "Can I lick you?" line from avidemux:
Code:
856
01:42:22,634 --> 01:42:24,634
Can I lick you?
Was slightly earlier than the left one at the very beginning.
Edit3: And from 1080p mp4box(translation for it changed):
Code:
861
01:42:23,650 --> 01:42:25,650
Can I lick your penis?
Does seem way off(30+ secs) if I watch the vid but hard to tell. Gonna try to run it on default whisper on a test colab I just made.
Edit4: Default whisper 1080p mp4box:
Code:
1096
01:42:23,500 --> 01:42:24,500
Can I lick you?
So yeah, pretty much always in the same 1 sec range or so but whisper being whisper, it's rarely the same.
Doesn't seem like you're doing anything wrong, looks like it's a whisper issue to me, but I'll try to convert the audio and see if anything changes.
I also attached all the srt files from the tests so far.
Final Edit: The ffmpeg mono wav(the format whisper uses internally) had a very similar result to the first mp4box one(Exact same for that one line I keep comparing) with the usual variance in whisper results so yeah, I'm out of idea to test, nothing wrong as far as I can tell with the audio.