Skip to content
Snippets Groups Projects
Commit 96ea5411 authored by Sanyam Bhutani's avatar Sanyam Bhutani
Browse files

Added all notes

parent 5b556930
No related branches found
No related tags found
No related merge requests found
......@@ -6,45 +6,91 @@ The goal was to use the models that were easy to setup and sounded less robotic
#### Parler-TTS
Minimal code to run their models:
Surprisingly, Parler's mini model sounded more natural. In their [repo]() they share names of speakers that we can use in prompt
Actually this IS THE MOST CONSISTENT PROMPT:
Small:
```
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")
# Define text and description
text_prompt = "This is where the actual words to be spoken go"
description = """
Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise.
"""
```
Large:
```
description = """
Alisa's voice is consistent, quite expressive and dramatic in delivery, with a very close recording that almost has no background noise.
"""
input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(text_prompt, return_tensors="pt").input_ids.to(device)
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
ipd.Audio(audio_arr, rate=model.config.sampling_rate)
```
Small:
The really cool aspect of these models are the ability to prompt the `description` which can change the speaker profile and pacing of the outputs.
Surprisingly, Parler's mini model sounded more natural.
In their [repo](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md#speaker-consistency) they share names of speakers that we can use in prompt.
#### Suno/Bark
Minimal code to run bark:
```
description = """
Jenna's voice is consistent, quite expressive and dramatic in delivery, with a very close recording that almost has no background noise.
voice_preset = "v2/en_speaker_6"
sampling_rate = 24000
text_prompt = """
Exactly! [sigh] And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.
"""
inputs = processor(text_prompt, voice_preset=voice_preset).to(device)
speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)
```
#### Suno/Bark
Similar to parler models, suno has a [library](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c) of speakers.
Bark is cool but just v6 works great, I tried v9 but its quite robotic and that is sad.
v9 from their library sounded robotic so we use Parler for our first speaker and the best one from bark.
Bark-Tests: Best results for speaker/v6 are at ```speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)```
The incredible thing about Bark model is being able to add sound effects: `[Laugh]`, `[Gasps]`, `[Sigh]`, `[clears throat]`, making words capital causes the model to emphasise them.
Adding `-` gives a break in the text. We utilise this knowledge when we re-write the transcript using the 8B model to add effects to our transcript.
Note: Authors suggest using `...`. However, this didn't work as effectively as adding a hypen during trails.
#### Hyper-parameters:
Tested sound effects:
Bark models have two parameters we can tweak: `temperature` and `semantic_temperature`
- Laugh is probably most effective
- Sigh is hit or miss
- Gasps doesn't work
- A singly hypen is effective
- Captilisation makes it louder
Below are the notes from a sweep, prompt and speaker were fixed and this was a vibe test to see which gives best results. `temperature` and `semantic_temperature` respectively below:
First, fix `temperature` and sweep `semantic_temperature`
- `0.7`, `0.2`: Quite bland and boring
- `0.7`, `0.3`: An improvement over the previous one
- `0.7`, `0.4`: Further improvement
- `0.7`, `0.5`: This one didn't work
- `0.7`, `0.6`: So-So, didn't stand out
- `0.7`, `0.7`: The best so far
- `0.7`, `0.8`: Further improvement
- `0.7`, `0.9`: Mix feelings on this one
Now sweeping the `temperature`
- `0.1`, `0.9`: Very Robotic
- `0.2`, `0.9`: Less Robotic but not convincing
- `0.3`, `0.9`: Slight improvement still not fun
- `0.4`, `0.9`: Still has a robotic tinge
- `0.5`, `0.9`: The laugh was weird on this one but the voice modulates so much it feels speaker is changing
- `0.6`, `0.9`: Most consistent voice but has a robotic after-taste
- `0.7`, `0.9`: Very robotic and laugh was weird
- `0.8`, `0.9`: Completely ignore the laughter but it was more natural
- `0.9`, `0.9`: We have a winner probably
After this about ~30 more sweeps were done with the promising combinations:
Best results are at ```speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)```
### Notes from other models that were tested:
......@@ -57,8 +103,6 @@ Promising directions to explore in future:
- E2-TTS: r/locallama claims this to be a little better, however, it didn't pass the vibe test
- [xTTS](https://coqui.ai/blog/tts/open_xtts) It has great documentation and also seems promising
#### Some more models that weren't tested:
In other words, we leave this as an excercise to readers :D
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment