Skip to content
Snippets Groups Projects
Commit cd0778cd authored by Eustache Le Bihan's avatar Eustache Le Bihan
Browse files

update README

parent 67cf9d82
No related branches found
No related tags found
No related merge requests found
# Speech To Speech: an effort for an open-sourced and modular GPT4-o # Speech to Speech: An Open-Sourced and Modular GPT-4 Pipeline
## Approach ## Approach
### structure ### Structure
This repo implements a speech to speech cascaded pipeline, with consecutive parts: This repository implements a speech-to-speech cascaded pipeline with consecutive parts:
1. Voice Activity Detection (VAD): [silero VAD v5](https://github.com/snakers4/silero-vad) 1. **Voice Activity Detection (VAD)**: [silero VAD v5](https://github.com/snakers4/silero-vad)
2. Speech to Text (STT): Whisper models (including distilled ones) 2. **Speech to Text (STT)**: Whisper models (including distilled versions)
3. Language Model (LM): whatever instruct model that is on the hub! 🤗 3. **Language Model (LM)**: Any instruct model available on the [Hugging Face Hub](https://huggingface.co/models)! 🤗
4. Text to Speech (TTS): [Parler-TTS](https://github.com/huggingface/parler-tts) 4. **Text to Speech (TTS)**: [Parler-TTS](https://github.com/huggingface/parler-tts)
### modularity ### Modularity
It aims to come with a fully open and modulable approach, leveraging through the Transformers library models already avaible on the Hugging Face hub. The pipeline aims to provide a fully open and modular approach, leveraging models available on the Transformers library via the Hugging Face hub. The level of modularity intended for each part is as follows:
The level of modularity intented for each part is the following: - **VAD**: Uses the implementation from [Silero's repo](https://github.com/snakers4/silero-vad).
- VAD: we should stick with the proposed implementation mainly taken from [Silero's repo](https://github.com/snakers4/silero-vad) - **STT**: Uses Whisper models exclusively; however, any Whisper checkpoint can be used, enabling options like [Distil-Whisper](https://huggingface.co/distil-whisper/distil-large-v3) and [French Distil-Whisper](https://huggingface.co/eustlb/distil-large-v3-fr).
- STT: we should stick with Whisper. Nevertheless, every Whisper checkpoint can be used, enabling the usage of distil-whisper and french-distil-whisper for example - **LM**: This part is fully modular and can be changed by simply modifying the Hugging Face hub model ID. Users need to select an instruct model since the usage here involves interacting with it.
- LM: this part is fully modulable and can be changed by simply modifying the Hugging Face hub model id. Nevertheless, one need to pick and instruct model since usage here is chatting with the model. - **TTS**: The mini architecture of Parler-TTS is standard, but different checkpoints, including fine-tuned multilingual checkpoints, are supported.
- TTs: we should stick with Parler-TTS mini architecture. Nevertheless, different checkpoints of the model can be used, notably enabling usage of fine-tuned multilingual checkpoints
Moreover, the pipeline is designed and code written with the intent of making it as easy as possible to be modified. Each part is implemented as a class and can be easily re implemened to match your needs. The code is designed to facilitate easy modification. Each component is implemented as a class and can be re-implemented to match specific needs.
## Setup ## Setup
Clone the repo: Clone the repository:
```bash ```bash
git clone https://github.com/eustlb/speech-to-speech.git git clone https://github.com/eustlb/speech-to-speech.git
cd speech-to-speech cd speech-to-speech
...@@ -34,83 +33,69 @@ pip install -r requirements.txt ...@@ -34,83 +33,69 @@ pip install -r requirements.txt
## Usage ## Usage
This pipeline is intented to be run in two ways: The pipeline can be run in two ways:
- on a server/ client manner, where models run a server and audio input/ output are streamed from a client - **Server/Client approach**: Models run on a server, and audio input/output are streamed from a client.
- locally, where we use the same approach as the client/server by simply using the loopback address - **Local approach**: Uses the same client/server method but with the loopback address.
### server/ client approach ### Server/Client Approach
To run the pipeline, first run on the server: To run the pipeline on the server:
```bash ```bash
python s2s_pipeline.py --recv_host "0.0.0.0" --send_host "0.0.0.0" python s2s_pipeline.py --recv_host "0.0.0.0" --send_host "0.0.0.0"
``` ```
Then run the client locally that will handle sending microphone input and receiving generated audio: Then run the client locally to handle sending microphone input and receiving generated audio:
```bash ```bash
python listen_and_play.py --host *ip adress of your server* python listen_and_play.py --host <IP address of your server>
``` ```
### local approach ### Local Approach
Simply use the loopback address. Simply use the loopback address:
Run the pipeline:
```bash ```bash
python s2s_pipeline.py --recv_host localhost --send_host localhost python s2s_pipeline.py --recv_host localhost --send_host localhost
```
Run the client:
```bash
python listen_and_play.py --host localhost python listen_and_play.py --host localhost
``` ```
## Command-line usage ## Command-line Usage
### model parameters ### Model Parameters
`model_name`, `torch_dtype` and `device` are exposed for each parts leveraging the Transformers' implementations: Speech To Text, Language Model, Text to Speech. `model_name`, `torch_dtype`, and `device` are exposed for each part leveraging the Transformers' implementations: Speech to Text, Language Model, and Text to Speech. Specify the targeted pipeline part with the corresponding prefix:
Simply specify the targeted pipeline part with the corresponding prefix: - `stt` (Speech to Text)
- `stt` (Speech To Text)
- `lm` (Language Model) - `lm` (Language Model)
- `tts` (Text To Speech) - `tts` (Text to Speech)
For example, `--lm_model_name google/gemma-2b-it`.
### generation parameters
Likewise, other generation parameter of the model's generate method can be set using the part's prefix + `_gen_`, e.g. `--stt_gen_max_new_tokens 128`.
If not already exposed, they can be easily added to the pipeline part's arguments class (see for example the `LanguageModelHandlerArguments` class)
### notable parameters
For example:
```bash
--lm_model_name google/gemma-2b-it
```
#### VAD parameters ### Generation Parameters
- `--thresh`: threshold value to trigger voice activity detection.
- `--min_speech_ms`: minimum duration of an detected voice activity to be considered as speech.
- `--min_silence_ms`: minimum length of silence intervals to be used for segmenting speech. Needs to be high enough to cut a sentence but low enough to reduce latency.
#### Language Model
- `--init_chat_role`: Defaults to `None`. Initial role in the chat template if there's any. Refer to the model's card to set this value (e.g. for
[Phi-3-mini-4k-instruct]((https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)) you have to set `--init_chat_role system`)
- `--init_chat_prompt` Needs to be set when setting `--init_chat_role`. Defaults to `"You are a helpful AI assistant."`
Other generation parameters of the model's generate method can be set using the part's prefix + `_gen_`, e.g., `--stt_gen_max_new_tokens 128`. These parameters can be added to the pipeline part's arguments class if not already exposed (see `LanguageModelHandlerArguments` for example).
### Notable Parameters
#### Speech To Text #### VAD Parameters
- `--description`: sets the description of Parler-TTS generated voice (see [here] for more details), defaults to: `"A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."` - `--thresh`: Threshold value to trigger voice activity detection.
- `--min_speech_ms`: Minimum duration of detected voice activity to be considered speech.
- `--min_silence_ms`: Minimum length of silence intervals for segmenting speech, balancing sentence cutting and latency reduction.
- `--play_steps_s`: duration of first chunk to be sent during streaming the output of Parler-TTS. A lower value mean the first chunk is ready faster, but will require more codec decoding steps overall. This value should be tuned to your device and latency requirements. #### Language Model
- `--init_chat_role`: Defaults to `None`. Sets the initial role in the chat template, if applicable. Refer to the model's card to set this value (e.g. for [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) you have to set `--init_chat_role system`)
- `--init_chat_prompt`: Defaults to `"You are a helpful AI assistant."` Required when setting `--init_chat_role`.
#### Speech to Text
- `--description`: Sets the description for Parler-TTS generated voice, with defaults describing the vocal qualities and environment of the speaker.
- `--play_steps_s`: Specifies the duration of the first chunk sent during streaming output from Parler-TTS, impacting readiness and decoding steps.
## Citations ## Citations
### Distil-Whisper ### Distil-Whisper
``` ```
@misc{gandhi2023distilwhisper, @misc{gandhi2023distilwhisper,
title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling}, title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling},
author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush}, author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
year={2023}, year={2023},
eprint={2311.00430}, eprint={2311.00430},
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment