Added all notes

96ea5411 · Sanyam Bhutani · 5b556930 · 96ea5411
Commit 96ea5411 authored 5 months ago by Sanyam Bhutani
--- a/recipes/quickstart/NotebookLlama/TTS_Notes.md
+++ b/recipes/quickstart/NotebookLlama/TTS_Notes.md
@@ -6,45 +6,91 @@ The goal was to use the models that were easy to setup and sounded less robotic

 #### Parler-TTS

+Minimal code to run their models:

-
-Surprisingly, Parler's mini model sounded more natural. In their [repo]() they share names of speakers that we can use in prompt 
-
-Actually this IS THE MOST CONSISTENT PROMPT:
-Small:
 ```
+model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
+tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")
+
+# Define text and description
+text_prompt = "This is where the actual words to be spoken go"
 description = """
 Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise.
 """
-```

-Large: 
-```
-description = """
-Alisa's voice is consistent, quite expressive and dramatic in delivery, with a very close recording that almost has no background noise.
-"""
+input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
+prompt_input_ids = tokenizer(text_prompt, return_tensors="pt").input_ids.to(device)
+
+generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
+audio_arr = generation.cpu().numpy().squeeze()
+
+ipd.Audio(audio_arr, rate=model.config.sampling_rate)
 ```
-Small:
+
+The really cool aspect of these models are the ability to prompt the `description` which can change the speaker profile and pacing of the outputs.
+
+Surprisingly, Parler's mini model sounded more natural.
+
+In their [repo](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md#speaker-consistency) they share names of speakers that we can use in prompt.
+
+#### Suno/Bark
+
+Minimal code to run bark:
+
 ```
-description = """
-Jenna's voice is consistent, quite expressive and dramatic in delivery, with a very close recording that almost has no background noise.
+voice_preset = "v2/en_speaker_6"
+sampling_rate = 24000
+
+text_prompt = """
+Exactly! [sigh] And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.
 """
+inputs = processor(text_prompt, voice_preset=voice_preset).to(device)
+
+speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
+Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)
 ```

-#### Suno/Bark
+Similar to parler models, suno has a [library](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c) of speakers.

-Bark is cool but just v6 works great, I tried v9 but its quite robotic and that is sad. 
+v9 from their library sounded robotic so we use Parler for our first speaker and the best one from bark.

-Bark-Tests: Best results for speaker/v6 are at ```speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
-Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)```
+The incredible thing about Bark model is being able to add sound effects: `[Laugh]`, `[Gasps]`, `[Sigh]`, `[clears throat]`, making words capital causes the model to emphasise them. 
+
+Adding `-` gives a break in the text. We utilise this knowledge when we re-write the transcript using the 8B model to add effects to our transcript.
+
+Note: Authors suggest using `...`. However, this didn't work as effectively as adding a hypen during trails.
+
+#### Hyper-parameters: 

-Tested sound effects:
+Bark models have two parameters we can tweak: `temperature` and `semantic_temperature`

- Laugh is probably most effective
- Sigh is hit or miss
- Gasps doesn't work
- A singly hypen is effective
- Captilisation makes it louder
+Below are the notes from a sweep, prompt and speaker were fixed and this was a vibe test to see which gives best results. `temperature` and `semantic_temperature` respectively below:
+
+First, fix `temperature` and sweep `semantic_temperature`
+- `0.7`, `0.2`: Quite bland and boring
+- `0.7`, `0.3`: An improvement over the previous one
+- `0.7`, `0.4`: Further improvement 
+- `0.7`, `0.5`: This one didn't work
+- `0.7`, `0.6`: So-So, didn't stand out
+- `0.7`, `0.7`: The best so far
+- `0.7`, `0.8`: Further improvement 
+- `0.7`, `0.9`: Mix feelings on this one
+
+Now sweeping the `temperature`
+- `0.1`, `0.9`: Very Robotic
+- `0.2`, `0.9`: Less Robotic but not convincing
+- `0.3`, `0.9`: Slight improvement still not fun
+- `0.4`, `0.9`: Still has a robotic tinge
+- `0.5`, `0.9`: The laugh was weird on this one but the voice modulates so much it feels speaker is changing
+- `0.6`, `0.9`: Most consistent voice but has a robotic after-taste
+- `0.7`, `0.9`: Very robotic and laugh was weird
+- `0.8`, `0.9`: Completely ignore the laughter but it was more natural
+- `0.9`, `0.9`: We have a winner probably
+
+After this about ~30 more sweeps were done with the promising combinations:
+
+Best results are at ```speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
+Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)```


 ### Notes from other models that were tested:
@@ -57,8 +103,6 @@ Promising directions to explore in future:
 - E2-TTS: r/locallama claims this to be a little better, however, it didn't pass the vibe test
 - [xTTS](https://coqui.ai/blog/tts/open_xtts) It has great documentation and also seems promising

-
-
 #### Some more models that weren't tested:

 In other words, we leave this as an excercise to readers :D