diff --git a/recipes/quickstart/NotebookLlama/TTS_Notes.md b/recipes/quickstart/NotebookLlama/TTS_Notes.md index 091fd688f31d3908f65f99e48d5527d1a0c36da0..a18d4c0f1bd41e31ffaf19854d96e73cde680c63 100644 --- a/recipes/quickstart/NotebookLlama/TTS_Notes.md +++ b/recipes/quickstart/NotebookLlama/TTS_Notes.md @@ -6,45 +6,91 @@ The goal was to use the models that were easy to setup and sounded less robotic #### Parler-TTS +Minimal code to run their models: - -Surprisingly, Parler's mini model sounded more natural. In their [repo]() they share names of speakers that we can use in prompt - -Actually this IS THE MOST CONSISTENT PROMPT: -Small: ``` +model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device) +tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1") + +# Define text and description +text_prompt = "This is where the actual words to be spoken go" description = """ Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise. """ -``` -Large: -``` -description = """ -Alisa's voice is consistent, quite expressive and dramatic in delivery, with a very close recording that almost has no background noise. -""" +input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device) +prompt_input_ids = tokenizer(text_prompt, return_tensors="pt").input_ids.to(device) + +generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids) +audio_arr = generation.cpu().numpy().squeeze() + +ipd.Audio(audio_arr, rate=model.config.sampling_rate) ``` -Small: + +The really cool aspect of these models are the ability to prompt the `description` which can change the speaker profile and pacing of the outputs. + +Surprisingly, Parler's mini model sounded more natural. + +In their [repo](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md#speaker-consistency) they share names of speakers that we can use in prompt. + +#### Suno/Bark + +Minimal code to run bark: + ``` -description = """ -Jenna's voice is consistent, quite expressive and dramatic in delivery, with a very close recording that almost has no background noise. +voice_preset = "v2/en_speaker_6" +sampling_rate = 24000 + +text_prompt = """ +Exactly! [sigh] And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources. """ +inputs = processor(text_prompt, voice_preset=voice_preset).to(device) + +speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8) +Audio(speech_output[0].cpu().numpy(), rate=sampling_rate) ``` -#### Suno/Bark +Similar to parler models, suno has a [library](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c) of speakers. -Bark is cool but just v6 works great, I tried v9 but its quite robotic and that is sad. +v9 from their library sounded robotic so we use Parler for our first speaker and the best one from bark. -Bark-Tests: Best results for speaker/v6 are at ```speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8) -Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)``` +The incredible thing about Bark model is being able to add sound effects: `[Laugh]`, `[Gasps]`, `[Sigh]`, `[clears throat]`, making words capital causes the model to emphasise them. + +Adding `-` gives a break in the text. We utilise this knowledge when we re-write the transcript using the 8B model to add effects to our transcript. + +Note: Authors suggest using `...`. However, this didn't work as effectively as adding a hypen during trails. + +#### Hyper-parameters: -Tested sound effects: +Bark models have two parameters we can tweak: `temperature` and `semantic_temperature` -- Laugh is probably most effective -- Sigh is hit or miss -- Gasps doesn't work -- A singly hypen is effective -- Captilisation makes it louder +Below are the notes from a sweep, prompt and speaker were fixed and this was a vibe test to see which gives best results. `temperature` and `semantic_temperature` respectively below: + +First, fix `temperature` and sweep `semantic_temperature` +- `0.7`, `0.2`: Quite bland and boring +- `0.7`, `0.3`: An improvement over the previous one +- `0.7`, `0.4`: Further improvement +- `0.7`, `0.5`: This one didn't work +- `0.7`, `0.6`: So-So, didn't stand out +- `0.7`, `0.7`: The best so far +- `0.7`, `0.8`: Further improvement +- `0.7`, `0.9`: Mix feelings on this one + +Now sweeping the `temperature` +- `0.1`, `0.9`: Very Robotic +- `0.2`, `0.9`: Less Robotic but not convincing +- `0.3`, `0.9`: Slight improvement still not fun +- `0.4`, `0.9`: Still has a robotic tinge +- `0.5`, `0.9`: The laugh was weird on this one but the voice modulates so much it feels speaker is changing +- `0.6`, `0.9`: Most consistent voice but has a robotic after-taste +- `0.7`, `0.9`: Very robotic and laugh was weird +- `0.8`, `0.9`: Completely ignore the laughter but it was more natural +- `0.9`, `0.9`: We have a winner probably + +After this about ~30 more sweeps were done with the promising combinations: + +Best results are at ```speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8) +Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)``` ### Notes from other models that were tested: @@ -57,8 +103,6 @@ Promising directions to explore in future: - E2-TTS: r/locallama claims this to be a little better, however, it didn't pass the vibe test - [xTTS](https://coqui.ai/blog/tts/open_xtts) It has great documentation and also seems promising - - #### Some more models that weren't tested: In other words, we leave this as an excercise to readers :D