A big hurdle towards fine-tuning Whisper (or any other model) is a lack of Japanese training data. OpenAI's dataset has 15,914 hours of JP audio (7054 hours with Japanese transcripts, 8860 with English ones), and even that dwarfs the publicly available ones I'm aware of.
Is anyone interested in...