Skip to content
Snippets Groups Projects
README.md 8.89 KiB
Newer Older
  • Learn to ignore specific revisions
  • Andres Marafioti's avatar
    Andres Marafioti committed
    <div align="center">
      <div>&nbsp;</div>
      <img src="logo.png" width="600"/> 
    </div>
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    # Speech To Speech: an effort for an open-sourced and modular GPT4-o
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    
    ## 📖 Quick Index
    * [Approach](#approach)
      - [Structure](#structure)
      - [Modularity](#modularity)
    * [Setup](#setup)
    * [Usage](#usage)
    
    joecryptotoo's avatar
    joecryptotoo committed
      - [Docker Server approach](#docker-server)
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
      - [Server/Client approach](#serverclient-approach)
    
    rchan's avatar
    rchan committed
      - [Local approach](#local-approach-running-on-mac)
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    * [Command-line usage](#command-line-usage)
      - [Model parameters](#model-parameters)
      - [Generation parameters](#generation-parameters)
      - [Notable parameters](#notable-parameters)
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ## Approach
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ### Structure
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    This repository implements a speech-to-speech cascaded pipeline consisting of the following parts:
    1. **Voice Activity Detection (VAD)**
    2. **Speech to Text (STT)**
    3. **Language Model (LM)**
    4. **Text to Speech (TTS)**
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ### Modularity
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    The pipeline provides a fully open and modular approach, with a focus on leveraging models available through the Transformers library on the Hugging Face hub. The code is designed for easy modification, and we already support device-specific and external library implementations:
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    **VAD** 
    - [Silero VAD v5](https://github.com/snakers4/silero-vad)
    
    **STT**
    - Any [Whisper](https://huggingface.co/docs/transformers/en/model_doc/whisper) model checkpoint on the Hugging Face Hub through Transformers 🤗, including [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) and [distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)
    - [Lightning Whisper MLX](https://github.com/mustafaaljadery/lightning-whisper-mlx?tab=readme-ov-file#lightning-whisper-mlx)
    - [Paraformer - FunASR](https://github.com/modelscope/FunASR)
    
    **LLM**
    - Any instruction-following model on the [Hugging Face Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending) via Transformers 🤗
    - [mlx-lm](https://github.com/ml-explore/mlx-examples/blob/main/llms/README.md)
    - [OpenAI API](https://platform.openai.com/docs/quickstart)
    
    **TTS**
    - [Parler-TTS](https://github.com/huggingface/parler-tts) 🤗
    - [MeloTTS](https://github.com/myshell-ai/MeloTTS)
    - [ChatTTS](https://github.com/2noise/ChatTTS?tab=readme-ov-file)
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    
    ## Setup
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    Clone the repository:
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ```bash
    
    RodriMora's avatar
    RodriMora committed
    git clone https://github.com/huggingface/speech-to-speech.git
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    cd speech-to-speech
    ```
    
    
    Andres Marafioti's avatar
    Andres Marafioti committed
    Install the required dependencies using [uv](https://github.com/astral-sh/uv):
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ```bash
    
    Andres Marafioti's avatar
    Andres Marafioti committed
    uv pip install -r requirements.txt
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ```
    
    
    Andres Marafioti's avatar
    Andres Marafioti committed
    For Mac users, use the `requirements_mac.txt` file instead:
    ```bash
    uv pip install -r requirements_mac.txt
    ```
    
    If you want to use Melo TTS, you also need to run:
    ```bash
    python -m unidic download
    ```
    
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ## Usage
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    The pipeline can be run in two ways:
    - **Server/Client approach**: Models run on a server, and audio input/output are streamed from a client.
    
    Andres Marafioti's avatar
    Andres Marafioti committed
    - **Local approach**: Runs locally.
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    
    
    sergiopaniego's avatar
    sergiopaniego committed
    ### Recommended setup 
    
    joecryptotoo's avatar
    joecryptotoo committed
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ### Server/Client Approach
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    
    
    1. Run the pipeline on the server:
       ```bash
       python s2s_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0
       ```
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    
    
    2. Run the client locally to handle microphone input and receive generated audio:
       ```bash
       python listen_and_play.py --host <IP address of your server>
       ```
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    
    
    ### Local Approach (Mac)
    
    1. For optimal settings on Mac:
       ```bash
       python s2s_pipeline.py --local_mac_optimal_settings
       ```
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    
    
    This setting:
       - Adds `--device mps` to use MPS for all models.
         - Sets LightningWhisperMLX for STT
         - Sets MLX LM for language model
         - Sets MeloTTS for TTS
    
    Vaibhav Srivastav's avatar
    Vaibhav Srivastav committed
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ### Docker Server
    
    #### Install the NVIDIA Container Toolkit
    
    https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
    
    #### Start the docker container
    ```docker compose up```
    
    
    
    
    Andres Marafioti's avatar
    Andres Marafioti committed
    ### Recommended usage with Cuda
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    
    
    sergiopaniego's avatar
    sergiopaniego committed
    Leverage Torch Compile for Whisper and Parler-TTS. **The usage of Parler-TTS allows for audio output streaming, further reducing the overall latency** 🚀:
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    
    ```bash
    python s2s_pipeline.py \
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    	--lm_model_name microsoft/Phi-3-mini-4k-instruct \
    	--stt_compile_mode reduce-overhead \
    	--tts_compile_mode default \
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
      --recv_host 0.0.0.0 \
    	--send_host 0.0.0.0 
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ```
    
    For the moment, modes capturing CUDA Graphs are not compatible with streaming Parler-TTS (`reduce-overhead`, `max-autotune`).
    
    
    ### Multi-language Support
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    The pipeline currently supports English, French, Spanish, Chinese, Japanese, and Korean.  
    Two use cases are considered:
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    - **Single-language conversation**: Enforce the language setting using the `--language` flag, specifying the target language code (default is 'en').
    - **Language switching**: Set `--language` to 'auto'. In this case, Whisper detects the language for each spoken prompt, and the LLM is prompted with "`Please reply to my message in ...`" to ensure the response is in the detected language.
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    Please note that you must use STT and LLM checkpoints compatible with the target language(s). For the STT part, Parler-TTS is not yet multilingual (though that feature is coming soon! 🤗). In the meantime, you should use Melo (which supports English, French, Spanish, Chinese, Japanese, and Korean) or Chat-TTS.
    
    #### With the server version:
    
    
    For automatic language detection:
    
    ```bash
    python s2s_pipeline.py \
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
        --stt_model_name large-v3 \
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
        --language auto \
    
        --mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct \
    ```
    
    Or for one language in particular, chinese in this example
    
    ```bash
    python s2s_pipeline.py \
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
        --stt_model_name large-v3 \
    
        --language zh \
        --mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct \
    ```
    
    #### Local Mac Setup
    
    For automatic language detection:
    
    ```bash
    python s2s_pipeline.py \
        --local_mac_optimal_settings \
        --device mps \
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
        --stt_model_name large-v3 \
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
        --language auto \
    
        --mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
    ```
    
    Or for one language in particular, chinese in this example
    
    ```bash
    python s2s_pipeline.py \
        --local_mac_optimal_settings \
        --device mps \
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
        --stt_model_name large-v3 \
    
        --language zh \
        --mlx_lm_model_name mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
    ```
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ## Command-line Usage
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    > **_NOTE:_** References for all the CLI arguments can be found directly in the [arguments classes](https://github.com/huggingface/speech-to-speech/tree/d5e460721e578fef286c7b64e68ad6a57a25cf1b/arguments_classes) or by running `python s2s_pipeline.py -h`.
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ### Module level Parameters 
    See [ModuleArguments](https://github.com/huggingface/speech-to-speech/blob/d5e460721e578fef286c7b64e68ad6a57a25cf1b/arguments_classes/module_arguments.py) class. Allows to set:
    - a common `--device` (if one wants each part to run on the same device)
    - `--mode` `local` or `server`
    - chosen STT implementation 
    - chosen LM implementation
    - chose TTS implementation
    - logging level
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ### VAD parameters
    See [VADHandlerArguments](https://github.com/huggingface/speech-to-speech/blob/d5e460721e578fef286c7b64e68ad6a57a25cf1b/arguments_classes/vad_arguments.py) class. Notably:
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    - `--thresh`: Threshold value to trigger voice activity detection.
    - `--min_speech_ms`: Minimum duration of detected voice activity to be considered speech.
    - `--min_silence_ms`: Minimum length of silence intervals for segmenting speech, balancing sentence cutting and latency reduction.
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ### STT, LM and TTS parameters
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    `model_name`, `torch_dtype`, and `device` are exposed for each implementation of the Speech to Text, Language Model, and Text to Speech. Specify the targeted pipeline part with the corresponding prefix (e.g. `stt`, `lm` or `tts`, check the implementations' [arguments classes](https://github.com/huggingface/speech-to-speech/tree/d5e460721e578fef286c7b64e68ad6a57a25cf1b/arguments_classes) for more details).
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    
    For example:
    ```bash
    
    --lm_model_name google/gemma-2b-it
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ```
    
    ### Generation parameters
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    Other generation parameters of the model's generate method can be set using the part's prefix + `_gen_`, e.g., `--stt_gen_max_new_tokens 128`. These parameters can be added to the pipeline part's arguments class if not already exposed.
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    
    ## Citations
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ### Silero VAD
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ```bibtex
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    @misc{Silero VAD,
      author = {Silero Team},
      title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
      year = {2021},
      publisher = {GitHub},
      journal = {GitHub repository},
      howpublished = {\url{https://github.com/snakers4/silero-vad}},
      commit = {insert_some_commit_here},
      email = {hello@silero.ai}
    }
    ```
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ### Distil-Whisper
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ```bibtex
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    @misc{gandhi2023distilwhisper,
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
          title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling},
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
          author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
          year={2023},
          eprint={2311.00430},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }
    ```
    
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ### Parler-TTS
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    ```bibtex
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    @misc{lacombe-etal-2024-parler-tts,
      author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
      title = {Parler-TTS},
      year = {2024},
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
      publisher = {GitHub},
      journal = {GitHub repository},
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
      howpublished = {\url{https://github.com/huggingface/parler-tts}}
    
    Eustache Le Bihan's avatar
    Eustache Le Bihan committed
    }
    
    RodriMora's avatar
    RodriMora committed
    ```