I've been impressed by Whisper. I am running it locally, which I can handle the Medium model but not the large. Various thoughts on it and how it relates to JAV. I think it is worth making a different thread for discussion of optimization, as there are many parameters.
- Whisper is best at picking up on plot segments where characters are calmly talking in simple sentence structures.
- These "large language models" seemingly avoid edgy material, so conversations about slavery and r*** or super explicit sex talk will end up with euphemisms at best but more likely to just come off as feeling totally wrong. I don't know what parameters would allow it to pick up on edgier language. I have some ideas, though.
- Unless you are using the additional VAD, you probably need to spend time figuring out good settings for no_speech_ threshhold and logprob_threshhold. no_speech_ threshhold = 0.3 and logprob_threshhold=0.1 seem like good starting points that don't pick up on too much fake noise or splitting up one single line into multiple fragments, which are the two common problems with detecting voices. There's much more nuance than that, as they are linked to each other and depend on the sound quality of the video.
- If you have a GPU, you want to make sure you are using it. This requires setting up Cuda, which can take some installing and uninstalling of pytorch to make sure you have the correct versions
- When you are doing an initial test, you likely want to use the Tiny model, which is the fastest and the only one that can really run on a CPU.
- Most modern gaming PCs can likely handle the Medium model, but the Large is enough that you either need to use the web service collab that others post, or have a $3500 gaming PC.
- There is huge value to using the larger models, especially when it comes to consistently discerning character names and avoiding the issues with repetitive lines.
- "Temperature" , "beam_size", and "best_of" are options that might take some tinkering before some optimal solution for JAV is discovered.
- There are models that are a bit more tailored to Japanese, such as https://huggingface.co/vumichien/whisper-medium-jp. I do not know how to get it to work after downloading the .bin file, though. None of the guides are written for the perspective of somebody that is new to using these HuggingFace models.
I have some security questions. When I upload files to here, what happens to the metadata that would identify my computer name? If I use the online Collab service, can other people link what I translate to my Google account?
An example of a pretty good transcript that I've been able to generated. There's a good amount of "mean" that fits the Attackers script.
[01:00:00.560 --> 01:00:02.940] You...
[01:00:03.740 --> 01:00:06.600] Stop it.
[01:00:15.520 --> 01:00:17.440] My heart can't handle it.
[01:00:17.480 --> 01:00:18.660] No...
[01:00:20.000 --> 01:00:22.440] What's gotten into you?
[01:00:22.900 --> 01:00:24.600] Come on.
[01:00:25.560 --> 01:00:27.020] Do you hear?
[01:01:27.020 --> 01:01:29.860] You think you'll have fun?
[01:01:29.860 --> 01:01:31.820] Come here.
[01:01:44.340 --> 01:01:47.300] Good girl...
[01:01:52.240 --> 01:01:54.040] Take a kiss...
[01:01:54.040 --> 01:01:56.580] You can call it a blessing.
[01:01:57.580 --> 01:01:58.780] No.
[01:01:58.780 --> 01:02:00.620] Hold yourself down.
[01:02:00.620 --> 01:02:04.240] Give it to her.
[01:02:04.240 --> 01:02:05.280] Let her go.
[01:02:05.280 --> 01:02:07.000] No.
[01:02:09.020 --> 01:02:10.500] I'm sorry.
[01:02:10.500 --> 01:02:20.540] Do whatever you want, you half-hearted brat.
[01:02:20.540 --> 01:02:25.580] I won't.