diff --git a/README.md b/README.md index 38aaf58468b3c86576034c7406b8b5edb03b2ecc..d38632fc0ae2fd1a170ee944c553c3002e3c9fd6 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,44 @@ # Llama Recipes: Examples to get started using the Llama models from Meta <!-- markdown-link-check-disable --> + +> Note: We recently did a refactor of the repo, [archive-main](https://github.com/meta-llama/llama-recipes/tree/archive-main) is a snapshot branch from before the refactor + +Welcome to the official repository for helping you get started with [inference](https://github.com/meta-llama/llama-recipes/tree/main/getting-started/inference), [fine-tuning](https://github.com/init27/llama-recipes/tree/main/getting-started/finetuning) and [end-to-end use-cases](https://github.com/meta-llama/llama-recipes/tree/main/end-to-end-use-cases) of building with the Llama Model family. + +The examples cover the most popular community approaches, popular use-cases and the latest Llama 3.2 Vision and Llama 3.2 Text, in this repository. + +> [!TIP] +> Repository Structure: +> * [Start building with the Llama 3.2 models](./getting-started/) +> * [End to End Use cases with Llama model family](https://github.com/meta-llama/llama-recipes/tree/main/end-to-end-use-cases) +> * [Examples of building with 3rd Party Llama Providers](https://github.com/meta-llama/llama-recipes/tree/main/3p-integrations) +> * [Model Benchmarks](https://github.com/meta-llama/llama-recipes/tree/main/benchmarks) + +> [!TIP] +> Get started with Llama 3.2 with these new recipes: +> * [Finetune Llama 3.2 Vision](https://github.com/meta-llama/llama-recipes/blob/main/recipes/getting-started/finetuning/finetune_vision_model.md) +> * [Multimodal Inference with Llama 3.2 Vision](https://github.com/meta-llama/llama-recipes/blob/main/recipes/getting-started/inference/local_inference/README.md#multimodal-inference) +> * [Inference on Llama Guard 1B + Multimodal inference on Llama Guard 11B-Vision](https://github.com/meta-llama/llama-recipes/blob/main/recipes/responsible_ai/llama_guard/llama_guard_text_and_vision_inference.ipynb) + +<!-- markdown-link-check-enable --> +> [!NOTE] +> Llama 3.2 follows the same prompt template as Llama 3.1, with a new special token `<|image|>` representing the input image for the multimodal models. +> +> More details on the prompt templates for image reasoning, tool-calling and code interpreter can be found [on the documentation website](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_2). + + +## Repository Structure: + +- [3P Integrations](https://github.com/init27/llama-recipes/tree/main/3p-integrations): Getting Started Recipes and End to End Use-Cases from various Llama providers +- [End to End Use Cases](https://github.com/init27/llama-recipes/tree/main/end-to-end-use-cases): As the name suggests, spanning various domains and applications +- [Getting Started](https://github.com/init27/llama-recipes/tree/main/getting-started/): Reference for inferencing, fine-tuning and RAG examples +- [Benchmarks](https://github.com/init27/llama-recipes/tree/main/benchmarks): + + +## FAQ: + + + The 'llama-recipes' repository is a companion to the [Meta Llama](https://github.com/meta-llama/llama-models) models. We support the latest version, [Llama 3.2 Vision](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md) and [Llama 3.2 Text](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md), in this repository. This repository contains example scripts and notebooks to get started with the models in a variety of use-cases, including fine-tuning for domain adaptation and building LLM-based applications with Llama and other tools in the LLM ecosystem. The examples here use Llama locally, in the cloud, and on-prem. > [!TIP] diff --git a/UPDATES.md b/UPDATES.md index 9ff4b961dd61e7dc4abac1ca080efc283135d422..f4dc5cef27ae8367c891be483a927ee6fa032d8c 100644 --- a/UPDATES.md +++ b/UPDATES.md @@ -13,10 +13,11 @@ Nested Folders rename: - /dev_requirements.txt -> /src/dev_requirements.txt - /requirements.txt -> /src/requirements.txt - /tools -> /end-to-end-use-cases/benchmarks/ +- /recipes/experimental/long_context -> /end-to-end-use-cases/long_context Removed folders: - /flagged (Empty folder) - /recipes/quickstart/Running_Llama3_Anywhere (Redundant code) -- /recipes/quickstart/codellama (deprecated model) +- /recipes/quickstart/inference/codellama (deprecated model) diff --git a/recipes/experimental/long_context/H2O/README.md b/end-to-end-use-cases/long_context/H2O/README.md similarity index 100% rename from recipes/experimental/long_context/H2O/README.md rename to end-to-end-use-cases/long_context/H2O/README.md diff --git a/recipes/experimental/long_context/H2O/data/summarization/cnn_dailymail.jsonl b/end-to-end-use-cases/long_context/H2O/data/summarization/cnn_dailymail.jsonl similarity index 100% rename from recipes/experimental/long_context/H2O/data/summarization/cnn_dailymail.jsonl rename to end-to-end-use-cases/long_context/H2O/data/summarization/cnn_dailymail.jsonl diff --git a/recipes/experimental/long_context/H2O/data/summarization/xsum.jsonl b/end-to-end-use-cases/long_context/H2O/data/summarization/xsum.jsonl similarity index 100% rename from recipes/experimental/long_context/H2O/data/summarization/xsum.jsonl rename to end-to-end-use-cases/long_context/H2O/data/summarization/xsum.jsonl diff --git a/recipes/experimental/long_context/H2O/requirements.txt b/end-to-end-use-cases/long_context/H2O/requirements.txt similarity index 100% rename from recipes/experimental/long_context/H2O/requirements.txt rename to end-to-end-use-cases/long_context/H2O/requirements.txt diff --git a/recipes/experimental/long_context/H2O/run_streaming.py b/end-to-end-use-cases/long_context/H2O/run_streaming.py similarity index 100% rename from recipes/experimental/long_context/H2O/run_streaming.py rename to end-to-end-use-cases/long_context/H2O/run_streaming.py diff --git a/recipes/experimental/long_context/H2O/run_summarization.py b/end-to-end-use-cases/long_context/H2O/run_summarization.py similarity index 100% rename from recipes/experimental/long_context/H2O/run_summarization.py rename to end-to-end-use-cases/long_context/H2O/run_summarization.py diff --git a/recipes/experimental/long_context/H2O/src/streaming.sh b/end-to-end-use-cases/long_context/H2O/src/streaming.sh similarity index 100% rename from recipes/experimental/long_context/H2O/src/streaming.sh rename to end-to-end-use-cases/long_context/H2O/src/streaming.sh diff --git a/recipes/experimental/long_context/H2O/utils/cache.py b/end-to-end-use-cases/long_context/H2O/utils/cache.py similarity index 100% rename from recipes/experimental/long_context/H2O/utils/cache.py rename to end-to-end-use-cases/long_context/H2O/utils/cache.py diff --git a/recipes/experimental/long_context/H2O/utils/llama.py b/end-to-end-use-cases/long_context/H2O/utils/llama.py similarity index 100% rename from recipes/experimental/long_context/H2O/utils/llama.py rename to end-to-end-use-cases/long_context/H2O/utils/llama.py diff --git a/recipes/experimental/long_context/H2O/utils/streaming.py b/end-to-end-use-cases/long_context/H2O/utils/streaming.py similarity index 100% rename from recipes/experimental/long_context/H2O/utils/streaming.py rename to end-to-end-use-cases/long_context/H2O/utils/streaming.py diff --git a/getting-started/Running_Llama3_Anywhere/Running_Llama_on_HF_transformers.ipynb b/getting-started/Running_Llama3_Anywhere/Running_Llama_on_HF_transformers.ipynb deleted file mode 100644 index 06f0e4094afaac2114c9a9ebdffd3b24026cb801..0000000000000000000000000000000000000000 --- a/getting-started/Running_Llama3_Anywhere/Running_Llama_on_HF_transformers.ipynb +++ /dev/null @@ -1,336 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Running Meta Llama 3.1 on Google Colab using Hugging Face transformers library\n", - "This notebook goes over how you can set up and run Llama 3.1 using Hugging Face transformers library\n", - "<a href=\"https://colab.research.google.com/github/meta-llama/llama-recipes/blob/main/recipes/quickstart/Running_Llama2_Anywhere/Running_Llama_on_HF_transformers.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Steps at a glance:\n", - "This demo showcases how to run the example with already converted Llama 3.1 weights on [Hugging Face](https://huggingface.co/meta-llama). Please Note: To use the downloads on Hugging Face, you must first request a download as shown in the steps below making sure that you are using the same email address as your Hugging Face account.\n", - "\n", - "To use already converted weights, start here:\n", - "1. Request download of model weights from the Llama website\n", - "2. Login to Hugging Face from your terminal using the same email address as (1). Follow the instructions [here](https://huggingface.co/docs/huggingface_hub/en/quick-start). \n", - "3. Run the example\n", - "\n", - "\n", - "Else, if you'd like to download the models locally and convert them to the HF format, follow the steps below to convert the weights:\n", - "1. Request download of model weights from the Llama website\n", - "2. Clone the llama repo and get the weights\n", - "3. Convert the model weights\n", - "4. Prepare the script\n", - "5. Run the example" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Using already converted weights" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 1. Request download of model weights from the Llama website\n", - "Request download of model weights from the Llama website\n", - "Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on “download modelsâ€. \n", - "\n", - "Fill the required information, select the models “Meta Llama 3.1†and accept the terms & conditions. You will receive a URL in your email in a short time." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 2. Prepare the script\n", - "\n", - "We will install the Transformers library and Accelerate library for our demo.\n", - "\n", - "The `Transformers` library provides many models to perform tasks on texts such as classification, question answering, text generation, etc.\n", - "The `accelerate` library enables the same PyTorch code to be run across any distributed configuration of GPUs and CPUs.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pip install transformers\n", - "!pip install accelerate" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, we will import AutoTokenizer, which is a class from the transformers library that automatically chooses the correct tokenizer for a given pre-trained model, import transformers library and torch for PyTorch.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "from transformers import AutoTokenizer\n", - "import transformers\n", - "import torch" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Then, we will set the model variable to a specific model we’d like to use. In this demo, we will use the 8b chat model `meta-llama/Meta-Llama-3.1-8B-Instruct`. Using Meta models from Hugging Face requires you to\n", - "\n", - "1. Accept Terms of Service for Meta Llama 3.1 on Meta [website](https://llama.meta.com/llama-downloads).\n", - "2. Use the same email address from Step (1) to login into Hugging Face.\n", - "\n", - "Follow the instructions on this Hugging Face page to login from your [terminal](https://huggingface.co/docs/huggingface_hub/en/quick-start). " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pip install --upgrade huggingface_hub" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from huggingface_hub import login\n", - "login()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "model = \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n", - "tokenizer = AutoTokenizer.from_pretrained(model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, we will use the `from_pretrained` method of `AutoTokenizer` to create a tokenizer. This will download and cache the pre-trained tokenizer and return an instance of the appropriate tokenizer class.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pipeline = transformers.pipeline(\n", - "\"text-generation\",\n", - " model=model,\n", - " torch_dtype=torch.float16,\n", - " device_map=\"auto\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 3. Run the example\n", - "\n", - "Now, let’s create the pipeline for text generation. We’ll also set the device_map argument to `auto`, which means the pipeline will automatically use a GPU if one is available.\n", - "\n", - "Let’s also generate a text sequence based on the input that we provide. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sequences = pipeline(\n", - " 'I have tomatoes, basil and cheese at home. What can I cook for dinner?\\n',\n", - " do_sample=True,\n", - " top_k=10,\n", - " num_return_sequences=1,\n", - " eos_token_id=tokenizer.eos_token_id,\n", - " truncation = True,\n", - " max_length=400,\n", - ")\n", - "\n", - "for seq in sequences:\n", - " print(f\"Result: {seq['generated_text']}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "<br>\n", - "\n", - "### Downloading and converting weights to Hugging Face format" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 1. Request download of model weights from the Llama website\n", - "Request download of model weights from the Llama website\n", - "Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on “download modelsâ€. \n", - "\n", - "Fill the required information, select the models \"Meta Llama 3\" and accept the terms & conditions. You will receive a URL in your email in a short time." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 2. Clone the llama repo and get the weights\n", - "Git clone the [Meta Llama 3 repo](https://github.com/meta-llama/llama3). Run the `download.sh` script and follow the instructions. This will download the model checkpoints and tokenizer.\n", - "\n", - "This example demonstrates a Meta Llama 3.1 model with 8B-instruct parameters, but the steps we follow would be similar for other llama models, as well as for other parameter models." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 3. Convert the model weights using Hugging Face transformer from source\n", - "\n", - "* `python3 -m venv hf-convertor`\n", - "* `source hf-convertor/bin/activate`\n", - "* `git clone https://github.com/huggingface/transformers.git`\n", - "* `cd transformers`\n", - "* `pip install -e .`\n", - "* `pip install torch tiktoken blobfile accelerate`\n", - "* `python3 src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ${path_to_meta_downloaded_model} --output_dir ${path_to_save_converted_hf_model} --model_size 8B --llama_version 3.1`" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "#### 4. Prepare the script\n", - "Import the following necessary modules in your script: \n", - "* `AutoModel` is the Llama 3 model class\n", - "* `AutoTokenizer` prepares your prompt for the model to process\n", - "* `pipeline` is an abstraction to generate model outputs" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import torch\n", - "import transformers\n", - "from transformers import AutoModelForCausalLM, AutoTokenizer\n", - "\n", - "model_dir = \"${path_the_converted_hf_model}\"\n", - "model = AutoModelForCausalLM.from_pretrained(\n", - " model_dir,\n", - " device_map=\"auto\",\n", - " )\n", - "tokenizer = AutoTokenizer.from_pretrained(model_dir)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We need a way to use our model for inference. Pipeline allows us to specify which type of task the pipeline needs to run (`text-generation`), specify the model that the pipeline should use to make predictions (`model`), define the precision to use this model (`torch.float16`), device on which the pipeline should run (`device_map`) among various other options. \n" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "pipeline = transformers.pipeline(\n", - " \"text-generation\",\n", - " model=model,\n", - " tokenizer=tokenizer,\n", - " torch_dtype=torch.float16,\n", - " device_map=\"auto\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we have our pipeline defined, and we need to provide some text prompts as inputs to our pipeline to use when it runs to generate responses (`sequences`). The pipeline shown in the example below sets `do_sample` to True, which allows us to specify the decoding strategy we’d like to use to select the next token from the probability distribution over the entire vocabulary. In our example, we are using top_k sampling. \n", - "\n", - "By changing `max_length`, you can specify how long you’d like the generated response to be. \n", - "Setting the `num_return_sequences` parameter to greater than one will let you generate more than one output.\n", - "\n", - "In your script, add the following to provide input, and information on how to run the pipeline:\n", - "\n", - "\n", - "#### 5. Run the example" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sequences = pipeline(\n", - " 'I have tomatoes, basil and cheese at home. What can I cook for dinner?\\n',\n", - " do_sample=True,\n", - " top_k=10,\n", - " num_return_sequences=1,\n", - " eos_token_id=tokenizer.eos_token_id,\n", - " max_length=400,\n", - ")\n", - "for seq in sequences:\n", - " print(f\"{seq['generated_text']}\")\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.10" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/getting-started/Running_Llama3_Anywhere/Running_Llama_on_Mac_Windows_Linux.ipynb b/getting-started/Running_Llama3_Anywhere/Running_Llama_on_Mac_Windows_Linux.ipynb deleted file mode 100644 index 0a5f43059bc4af06884a319c21134bf3ce014d3b..0000000000000000000000000000000000000000 --- a/getting-started/Running_Llama3_Anywhere/Running_Llama_on_Mac_Windows_Linux.ipynb +++ /dev/null @@ -1,166 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Running Llama 3 on Mac, Windows or Linux\n", - "This notebook goes over how you can set up and run Llama 3.1 locally on a Mac, Windows or Linux using [Ollama](https://ollama.com/)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Steps at a glance:\n", - "1. Download and install Ollama.\n", - "2. Download and test run Llama 3.1\n", - "3. Use local Llama 3.1 via Python.\n", - "4. Use local Llama 3.1 via LangChain.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 1. Download and install Ollama\n", - "\n", - "On Mac or Windows, go to the Ollama download page [here](https://ollama.com/download) and select your platform to download it, then double click the downloaded file to install Ollama.\n", - "\n", - "On Linux, you can simply run on a terminal `curl -fsSL https://ollama.com/install.sh | sh` to download and install Ollama." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 2. Download and test run Llama 3\n", - "\n", - "On a terminal or console, run `ollama pull llama3.1` to download the Llama 3.1 8b chat model, in the 4-bit quantized format with size about 4.7 GB.\n", - "\n", - "Run `ollama pull llama3.1:70b` to download the Llama 3.1 70b chat model, also in the 4-bit quantized format with size 39GB.\n", - "\n", - "Then you can run `ollama run llama3.1` and ask Llama 3.1 questions such as \"who wrote the book godfather?\" or \"who wrote the book godfather? answer in one sentence.\" You can also try `ollama run llama3.1:70b`, but the inference speed will most likely be too slow - for example, on an Apple M1 Pro with 32GB RAM, it takes over 10 seconds to generate one token using Llama 3.1 70b chat (vs over 10 tokens per second with Llama 3.1 8b chat).\n", - "\n", - "You can also run the following command to test Llama 3.1 8b chat:\n", - "```\n", - " curl http://localhost:11434/api/chat -d '{\n", - " \"model\": \"llama3.1\",\n", - " \"messages\": [\n", - " {\n", - " \"role\": \"user\",\n", - " \"content\": \"who wrote the book godfather?\"\n", - " }\n", - " ],\n", - " \"stream\": false\n", - "}'\n", - "```\n", - "\n", - "The complete Ollama API doc is [here](https://github.com/ollama/ollama/blob/main/docs/api.md)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 3. Use local Llama 3.1 via Python\n", - "\n", - "The Python code below is the port of the curl command above." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import requests\n", - "import json\n", - "\n", - "url = \"http://localhost:11434/api/chat\"\n", - "\n", - "def llama3(prompt):\n", - " data = {\n", - " \"model\": \"llama3.1\",\n", - " \"messages\": [\n", - " {\n", - " \"role\": \"user\",\n", - " \"content\": prompt\n", - " }\n", - " ],\n", - " \"stream\": False\n", - " }\n", - " \n", - " headers = {\n", - " 'Content-Type': 'application/json'\n", - " }\n", - " \n", - " response = requests.post(url, headers=headers, json=data)\n", - " \n", - " return(response.json()['message']['content'])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "response = llama3(\"who wrote the book godfather\")\n", - "print(response)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 4. Use local Llama 3.1 via LangChain\n", - "\n", - "Code below use LangChain with Ollama to query Llama 3 running locally. For a more advanced example of using local Llama 3 with LangChain and agent-powered RAG, see [this](https://github.com/langchain-ai/langgraph/blob/main/examples/rag/langgraph_rag_agent_llama3_local.ipynb)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pip install langchain" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from langchain_community.chat_models import ChatOllama\n", - "\n", - "llm = ChatOllama(model=\"llama3.1\", temperature=0)\n", - "response = llm.invoke(\"who wrote the book godfather?\")\n", - "print(response.content)\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.9" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/getting-started/inference/code_llama/README.md b/getting-started/inference/code_llama/README.md deleted file mode 100644 index ef1be5e83731df0527483695f7c230e7f9acdd82..0000000000000000000000000000000000000000 --- a/getting-started/inference/code_llama/README.md +++ /dev/null @@ -1,39 +0,0 @@ -# Code Llama - -Code llama was recently released with three flavors, base-model that support multiple programming languages, Python fine-tuned model and an instruction fine-tuned and aligned variation of Code Llama, please read more [here](https://ai.meta.com/blog/code-llama-large-language-model-coding/). Also note that the Python fine-tuned model and 34B models are not trained on infilling objective, hence can not be used for infilling use-case. - -Find the scripts to run Code Llama, where there are two examples of running code completion and infilling. - -**Note** Please find the right model on HF [here](https://huggingface.co/models?search=meta-llama%20codellama). - -Make sure to install Transformers from source for now - -```bash - -pip install git+https://github.com/huggingface/transformers - -``` - -To run the code completion example: - -```bash - -python code_completion_example.py --model_name MODEL_NAME --prompt_file code_completion_prompt.txt --temperature 0.2 --top_p 0.9 - -``` - -To run the code infilling example: - -```bash - -python code_infilling_example.py --model_name MODEL_NAME --prompt_file code_infilling_prompt.txt --temperature 0.2 --top_p 0.9 - -``` -To run the 70B Instruct model example run the following (you'll need to enter the system and user prompts to instruct the model): - -```bash - -python code_instruct_example.py --model_name codellama/CodeLlama-70b-Instruct-hf --temperature 0.2 --top_p 0.9 - -``` -You can learn more about the chat prompt template [on HF](https://huggingface.co/meta-llama/CodeLlama-70b-Instruct-hf#chat-prompt) and [original Code Llama repository](https://github.com/meta-llama/codellama/blob/main/README.md#fine-tuned-instruction-models). HF tokenizer has already taken care of the chat template as shown in this example. diff --git a/getting-started/inference/code_llama/code_completion_example.py b/getting-started/inference/code_llama/code_completion_example.py deleted file mode 100644 index 201f8df8b084c4617b0ee60f249f4b4c9c12fbd5..0000000000000000000000000000000000000000 --- a/getting-started/inference/code_llama/code_completion_example.py +++ /dev/null @@ -1,119 +0,0 @@ -# Copyright (c) Meta Platforms, Inc. and affiliates. -# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement. - -# from accelerate import init_empty_weights, load_checkpoint_and_dispatch - -import fire -import os -import sys -import time - -import torch -from transformers import AutoTokenizer - -from llama_recipes.inference.safety_utils import get_safety_checker -from llama_recipes.inference.model_utils import load_model, load_peft_model - - -def main( - model_name, - peft_model: str=None, - quantization: bool=False, - max_new_tokens =100, #The maximum numbers of tokens to generate - prompt_file: str=None, - seed: int=42, #seed value for reproducibility - do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise. - min_length: int=None, #The minimum length of the sequence to be generated, input prompt + min_new_tokens - use_cache: bool=True, #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding. - top_p: float=0.9, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. - temperature: float=0.6, # [optional] The value used to modulate the next token probabilities. - top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering. - repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty. - length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation. - enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api - enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs - enable_salesforce_content_safety: bool=True, # Enable safety check with Salesforce safety flan t5 - enable_llamaguard_content_safety: bool=False, # Enable safety check with Llama-Guard - use_fast_kernels: bool = True, # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels - **kwargs -): - if prompt_file is not None: - assert os.path.exists( - prompt_file - ), f"Provided Prompt file does not exist {prompt_file}" - with open(prompt_file, "r") as f: - user_prompt = f.read() - else: - print("No user prompt provided. Exiting.") - sys.exit(1) - - # Set the seeds for reproducibility - torch.cuda.manual_seed(seed) - torch.manual_seed(seed) - - model = load_model(model_name, quantization, use_fast_kernels) - if peft_model: - model = load_peft_model(model, peft_model) - - model.eval() - - tokenizer = AutoTokenizer.from_pretrained(model_name) - safety_checker = get_safety_checker(enable_azure_content_safety, - enable_sensitive_topics, - enable_salesforce_content_safety, - enable_llamaguard_content_safety, - ) - - # Safety check of the user prompt - safety_results = [check(user_prompt) for check in safety_checker] - are_safe = all([r[1] for r in safety_results]) - if are_safe: - print("User prompt deemed safe.") - print(f"User prompt:\n{user_prompt}") - else: - print("User prompt deemed unsafe.") - for method, is_safe, report in safety_results: - if not is_safe: - print(method) - print(report) - print("Skipping the inference as the prompt is not safe.") - sys.exit(1) # Exit the program with an error status - - batch = tokenizer(user_prompt, return_tensors="pt") - - batch = {k: v.to("cuda") for k, v in batch.items()} - start = time.perf_counter() - with torch.no_grad(): - outputs = model.generate( - **batch, - max_new_tokens=max_new_tokens, - do_sample=do_sample, - top_p=top_p, - temperature=temperature, - min_length=min_length, - use_cache=use_cache, - top_k=top_k, - repetition_penalty=repetition_penalty, - length_penalty=length_penalty, - **kwargs - ) - e2e_inference_time = (time.perf_counter()-start)*1000 - print(f"the inference time is {e2e_inference_time} ms") - output_text = tokenizer.decode(outputs[0], skip_special_tokens=True) - - # Safety check of the model output - safety_results = [check(output_text) for check in safety_checker] - are_safe = all([r[1] for r in safety_results]) - if are_safe: - print("User input and model output deemed safe.") - print(f"Model output:\n{output_text}") - else: - print("Model output deemed unsafe.") - for method, is_safe, report in safety_results: - if not is_safe: - print(method) - print(report) - - -if __name__ == "__main__": - fire.Fire(main) diff --git a/getting-started/inference/code_llama/code_completion_prompt.txt b/getting-started/inference/code_llama/code_completion_prompt.txt deleted file mode 100644 index 8e184e2fe3fd374d17de604b16bf9d3b1489a10e..0000000000000000000000000000000000000000 --- a/getting-started/inference/code_llama/code_completion_prompt.txt +++ /dev/null @@ -1,7 +0,0 @@ -import argparse - -def main(string: str): - print(string) - print(string[::-1]) - -if __name__ == "__main__": \ No newline at end of file diff --git a/getting-started/inference/code_llama/code_infilling_example.py b/getting-started/inference/code_llama/code_infilling_example.py deleted file mode 100644 index a955eb5ce00e94f681d9d3a2b0ed2c58bf624661..0000000000000000000000000000000000000000 --- a/getting-started/inference/code_llama/code_infilling_example.py +++ /dev/null @@ -1,118 +0,0 @@ -# Copyright (c) Meta Platforms, Inc. and affiliates. -# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement. - -# from accelerate import init_empty_weights, load_checkpoint_and_dispatch - -import fire -import torch -import os -import sys -import time - -from transformers import AutoTokenizer - -from llama_recipes.inference.safety_utils import get_safety_checker -from llama_recipes.inference.model_utils import load_model, load_peft_model - -def main( - model_name, - peft_model: str=None, - quantization: bool=False, - max_new_tokens =100, #The maximum numbers of tokens to generate - prompt_file: str=None, - seed: int=42, #seed value for reproducibility - do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise. - min_length: int=None, #The minimum length of the sequence to be generated, input prompt + min_new_tokens - use_cache: bool=True, #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding. - top_p: float=0.9, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. - temperature: float=0.6, # [optional] The value used to modulate the next token probabilities. - top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering. - repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty. - length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation. - enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api - enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs - enable_salesforce_content_safety: bool=True, # Enable safety check with Salesforce safety flan t5 - enable_llamaguard_content_safety: bool=False, # Enable safety check with Llama-Guard - use_fast_kernels: bool = True, # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels - **kwargs -): - if prompt_file is not None: - assert os.path.exists( - prompt_file - ), f"Provided Prompt file does not exist {prompt_file}" - with open(prompt_file, "r") as f: - user_prompt = f.read() - else: - print("No user prompt provided. Exiting.") - sys.exit(1) - # Set the seeds for reproducibility - torch.cuda.manual_seed(seed) - torch.manual_seed(seed) - - model = load_model(model_name, quantization, use_fast_kernels) - model.config.tp_size=1 - if peft_model: - model = load_peft_model(model, peft_model) - - model.eval() - - tokenizer = AutoTokenizer.from_pretrained(model_name) - - safety_checker = get_safety_checker(enable_azure_content_safety, - enable_sensitive_topics, - enable_salesforce_content_safety, - enable_llamaguard_content_safety, - ) - - # Safety check of the user prompt - safety_results = [check(user_prompt) for check in safety_checker] - are_safe = all([r[1] for r in safety_results]) - if are_safe: - print("User prompt deemed safe.") - print(f"User prompt:\n{user_prompt}") - else: - print("User prompt deemed unsafe.") - for method, is_safe, report in safety_results: - if not is_safe: - print(method) - print(report) - print("Skipping the inference as the prompt is not safe.") - sys.exit(1) # Exit the program with an error status - - batch = tokenizer(user_prompt, return_tensors="pt") - batch = {k: v.to("cuda") for k, v in batch.items()} - - start = time.perf_counter() - with torch.no_grad(): - outputs = model.generate( - **batch, - max_new_tokens=max_new_tokens, - do_sample=do_sample, - top_p=top_p, - temperature=temperature, - min_length=min_length, - use_cache=use_cache, - top_k=top_k, - repetition_penalty=repetition_penalty, - length_penalty=length_penalty, - **kwargs - ) - e2e_inference_time = (time.perf_counter()-start)*1000 - print(f"the inference time is {e2e_inference_time} ms") - filling = tokenizer.batch_decode(outputs[:, batch["input_ids"].shape[1]:], skip_special_tokens=True)[0] - # Safety check of the model output - safety_results = [check(filling) for check in safety_checker] - are_safe = all([r[1] for r in safety_results]) - if are_safe: - print("User input and model output deemed safe.") - print(user_prompt.replace("<FILL_ME>", filling)) - else: - print("Model output deemed unsafe.") - for method, is_safe, report in safety_results: - if not is_safe: - print(method) - print(report) - - -if __name__ == "__main__": - fire.Fire(main) diff --git a/getting-started/inference/code_llama/code_infilling_prompt.txt b/getting-started/inference/code_llama/code_infilling_prompt.txt deleted file mode 100644 index 3fe94b7a5db69dc5f6dda9c6efb3d47575b72c87..0000000000000000000000000000000000000000 --- a/getting-started/inference/code_llama/code_infilling_prompt.txt +++ /dev/null @@ -1,3 +0,0 @@ -def remove_non_ascii(s: str) -> str: - """ <FILL_ME> - return result diff --git a/getting-started/inference/code_llama/code_instruct_example.py b/getting-started/inference/code_llama/code_instruct_example.py deleted file mode 100644 index d7b98f088be718f3bdb08d8b1c30beb990a143db..0000000000000000000000000000000000000000 --- a/getting-started/inference/code_llama/code_instruct_example.py +++ /dev/null @@ -1,143 +0,0 @@ -# Copyright (c) Meta Platforms, Inc. and affiliates. -# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement. - -import fire -import os -import sys -import time - -import torch -from transformers import AutoTokenizer - -from llama_recipes.inference.safety_utils import get_safety_checker -from llama_recipes.inference.model_utils import load_model, load_peft_model - - -def handle_safety_check(are_safe_user_prompt, user_prompt, safety_results_user_prompt, are_safe_system_prompt, system_prompt, safety_results_system_prompt): - """ - Handles the output based on the safety check of both user and system prompts. - - Parameters: - - are_safe_user_prompt (bool): Indicates whether the user prompt is safe. - - user_prompt (str): The user prompt that was checked for safety. - - safety_results_user_prompt (list of tuples): A list of tuples for the user prompt containing the method, safety status, and safety report. - - are_safe_system_prompt (bool): Indicates whether the system prompt is safe. - - system_prompt (str): The system prompt that was checked for safety. - - safety_results_system_prompt (list of tuples): A list of tuples for the system prompt containing the method, safety status, and safety report. - """ - def print_safety_results(are_safe_prompt, prompt, safety_results, prompt_type="User"): - """ - Prints the safety results for a prompt. - - Parameters: - - are_safe_prompt (bool): Indicates whether the prompt is safe. - - prompt (str): The prompt that was checked for safety. - - safety_results (list of tuples): A list of tuples containing the method, safety status, and safety report. - - prompt_type (str): The type of prompt (User/System). - """ - if are_safe_prompt: - print(f"{prompt_type} prompt deemed safe.") - print(f"{prompt_type} prompt:\n{prompt}") - else: - print(f"{prompt_type} prompt deemed unsafe.") - for method, is_safe, report in safety_results: - if not is_safe: - print(method) - print(report) - print(f"Skipping the inference as the {prompt_type.lower()} prompt is not safe.") - sys.exit(1) - - # Check user prompt - print_safety_results(are_safe_user_prompt, user_prompt, safety_results_user_prompt, "User") - - # Check system prompt - print_safety_results(are_safe_system_prompt, system_prompt, safety_results_system_prompt, "System") - -def main( - model_name, - peft_model: str=None, - quantization: bool=False, - max_new_tokens =100, #The maximum numbers of tokens to generate - seed: int=42, #seed value for reproducibility - do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise. - min_length: int=None, #The minimum length of the sequence to be generated, input prompt + min_new_tokens - use_cache: bool=False, #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding. - top_p: float=0.9, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. - temperature: float=0.6, # [optional] The value used to modulate the next token probabilities. - top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering. - repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty. - length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation. - enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api - enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs - enable_salesforce_content_safety: bool=True, # Enable safety check with Salesforce safety flan t5 - enable_llamaguard_content_safety: bool=False, # Enable safety check with Llama-Guard - use_fast_kernels: bool = True, # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels - **kwargs -): - system_prompt = input("Please insert your system prompt: ") - user_prompt = input("Please insert your prompt: ") - chat = [ - {"role": "system", "content": system_prompt}, - {"role": "user", "content": user_prompt}, - ] - # Set the seeds for reproducibility - torch.cuda.manual_seed(seed) - torch.manual_seed(seed) - - model = load_model(model_name, quantization, use_fast_kernels) - if peft_model: - model = load_peft_model(model, peft_model) - - model.eval() - - tokenizer = AutoTokenizer.from_pretrained(model_name) - safety_checker = get_safety_checker(enable_azure_content_safety, - enable_sensitive_topics, - enable_salesforce_content_safety, - enable_llamaguard_content_safety, - ) - - # Safety check of the user prompt - safety_results_user_prompt = [check(user_prompt) for check in safety_checker] - safety_results_system_prompt = [check(system_prompt) for check in safety_checker] - are_safe_user_prompt = all([r[1] for r in safety_results_user_prompt]) - are_safe_system_prompt = all([r[1] for r in safety_results_system_prompt]) - handle_safety_check(are_safe_user_prompt, user_prompt, safety_results_user_prompt, are_safe_system_prompt, system_prompt, safety_results_system_prompt) - - inputs = tokenizer.apply_chat_template(chat, return_tensors="pt").to("cuda") - - start = time.perf_counter() - with torch.no_grad(): - outputs = model.generate( - input_ids=inputs, - max_new_tokens=max_new_tokens, - do_sample=do_sample, - top_p=top_p, - temperature=temperature, - min_length=min_length, - use_cache=use_cache, - top_k=top_k, - repetition_penalty=repetition_penalty, - length_penalty=length_penalty, - **kwargs - ) - e2e_inference_time = (time.perf_counter()-start)*1000 - print(f"the inference time is {e2e_inference_time} ms") - output_text = tokenizer.decode(outputs[0], skip_special_tokens=True) - - # Safety check of the model output - safety_results = [check(output_text) for check in safety_checker] - are_safe = all([r[1] for r in safety_results]) - if are_safe: - print("User input and model output deemed safe.") - print(f"Model output:\n{output_text}") - else: - print("Model output deemed unsafe.") - for method, is_safe, report in safety_results: - if not is_safe: - print(method) - print(report) - - -if __name__ == "__main__": - fire.Fire(main) diff --git a/getting-started/inference/modelUpgradeExample.py b/getting-started/inference/modelUpgradeExample.py deleted file mode 100644 index f2fa19cd14eb14090be36e6a373c92c5fdea1e47..0000000000000000000000000000000000000000 --- a/getting-started/inference/modelUpgradeExample.py +++ /dev/null @@ -1,51 +0,0 @@ -# Copyright (c) Meta Platforms, Inc. and affiliates. -# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement. - -# Running the script without any arguments "python modelUpgradeExample.py" performs inference with the Llama 3 8B Instruct model. -# Passing --model-id "meta-llama/Meta-Llama-3.1-8B-Instruct" to the script will switch it to using the Llama 3.1 version of the same model. -# The script also shows the input tokens to confirm that the models are responding to the same input - -import fire -from transformers import AutoTokenizer, AutoModelForCausalLM -import torch - -def main(model_id = "meta-llama/Meta-Llama-3-8B-Instruct"): - tokenizer = AutoTokenizer.from_pretrained(model_id) - model = AutoModelForCausalLM.from_pretrained( - model_id, - torch_dtype=torch.bfloat16, - device_map="auto", - ) - - messages = [ - {"role": "system", "content": "You are a helpful chatbot"}, - {"role": "user", "content": "Why is the sky blue?"}, - {"role": "assistant", "content": "Because the light is scattered"}, - {"role": "user", "content": "Please tell me more about that"}, - ] - - input_ids = tokenizer.apply_chat_template( - messages, - add_generation_prompt=True, - return_tensors="pt", - ).to(model.device) - - print("Input tokens:") - print(input_ids) - - attention_mask = torch.ones_like(input_ids) - outputs = model.generate( - input_ids, - max_new_tokens=400, - eos_token_id=tokenizer.eos_token_id, - do_sample=True, - temperature=0.6, - top_p=0.9, - attention_mask=attention_mask, - ) - response = outputs[0][input_ids.shape[-1]:] - print("\nOutput:\n") - print(tokenizer.decode(response, skip_special_tokens=True)) - -if __name__ == "__main__": - fire.Fire(main) \ No newline at end of file diff --git a/recipes/README.md b/recipes/README.md deleted file mode 100644 index 86d90b7e0d04dfe6f8452869fc454f5d2f5ea649..0000000000000000000000000000000000000000 --- a/recipes/README.md +++ /dev/null @@ -1,11 +0,0 @@ -## Llama-Recipes - -This folder contains examples organized by topic: - -| Subfolder | Description | -|---|---| -[quickstart](./quickstart)|The "Hello World" of using Llama, start here if you are new to using Llama -[use_cases](./use_cases)|Scripts showing common applications of Llama -[3p_integrations](./3p_integrations)|Partner-owned folder showing Llama usage along with third-party tools -[responsible_ai](./responsible_ai)|Scripts to use PurpleLlama for safeguarding model outputs -[experimental](./experimental)| Llama implementations of experimental LLM techniques