Were you using the Big Japanese Model which can be used with Subtitle Edit? Because I've tried that one a week ago and it didn't seem much better than the small one (which was aweful)
Exactly!!! I used the new release of the big model -I'm sure you used the new release as well. I got slightly better results with the big model -- like 25% more lines. I ran the big model on 5 different audio versions: playing with volume, bitrate, and speed. Not much difference. The slow speed was worse among all versions
I have been looking around if one can fine tune the model through Kaldi. I haven't done model training before so it is a steep learning for me. It seems that almost all the models are trained for single speaker in "quiet" acoustics. Let me know if you come across any good workarounds
Side note, have you tried Google Visoion APIs for transcription? I haven't done any research on that but I was wondering if they use similar technique as the Youtube. Presumably more promising than Google Cloud Speech