Hi
@Scrapper , have you had a chance to evaluate the model? I looked at it back around the time when it was released. I didn't evaluate it thoroughly, and just did few quick tests. At the time I likely dismissed it too quickly, because I didn't in general think large-v3 was a suitable model for JAV. The test that I did was with their version which was using stable-ts as backend.
Later on I came across this other model which was built on top of Kotoba and I liked it a lot:
anime-whisper . This one is NSFW trained for anime. I realy like its style of transcription --you can see in their evaluation table how it renders any spoken disfluencies, which to me makes it so anime like. The big problem for me is that I can't get the timestamps out of the model. I really hope some clever person can solve for that. At the moment I use this model as my third parse and use it to compare the difficult sections aginst parse one and two.