That's not going to do anything, it'll just get ignored and keep the original sampling rate, which for the vast majority of audio is 48k or 44.1k(don't think I've seen a different one than those 2 in a movie or music/audio file unless it's a weird file).
It's almost certainly not going to be 16k from any audio source so it wouldn't make sense for that to be what's needed, unless whisper is converting it to that internally.
I haven't done any research on this and don't really use whisper or looked at its code so I have no clue about this but it would be a very odd choice to require 16k on the source for best results.
Edit: So, I checked the code and as I suspected, it's changing the sample rate internally to 16k during the decoding phase so the source sample rate doesn't matter.
Here's the code portion that does it. The "sr" value for ar is a variable set to 16000.
Code:
# This launches a subprocess to decode audio while down-mixing and resampling as necessary.
# Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
out, _ = (
ffmpeg.input(file, threads=0)
.output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
.run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)
)
Edit2: Here's the size of the output files for the test movie I used to give you an idea of why resampling it manually is just wasting space most of the time:
Original aac(192kbps): 249.35MB
Uncompressed wav: 1.92GB
Mei2/quay2 16k(stereo): 655.96MB
Whisper 16k(mono): 327.98MB
So unless you start with a not very compressed audio file that's above 256kbps(that's what the whisper processing ends up being), you're just ending up with a bigger file and you want to avoid using an extra lossy compression step to not degrade the audio quality so you have to use something like wav for the output if you process the original.