Have you tried making a mono file in the same way whisper does, with ffmpeg, to test it? Maybe whatever you're using is not using the same process and creating a different result or if you skip also making it 16k and leaving it at the original, it influence things.I've tried both stereo and mono and from what I can tell, I always get better results starting with a stereo file. It makes no difference if Whisper converts that stereo track into a mono track. My tests point to better results if the original track is in stereo. And for me there's pretty much no difference between two passes on the same file. I might get a couple more commas or full stops and one or two words in a few sentences will be different, but on the whole, I get the same content. If I change the Threshold, though, there are always clear differences. So, for me, I'll stick with my stereo uploads.
Just curious because it makes no sense that it would be any different unless the resulting audio is altered in some way.
Edit: With the above, I assumed you used wav when making them mono but if that's not the case, then that's likely the problem right there, you're doing an extra lossy encoding pass so you're losing details more than if you didn't do anything.
Last edited: