New folder structure (#1)

* Add new file org structure * Add new notebooks to quickstart on Mac and via HF * consolidate all images in a top-level folder * Update main README * Remove "news" section from main README * rename HF Trainer finetuning notebook and add detail to README Co-authored-by: Navyata Bawa <bnavyata@fb.com> Co-authored-by: Hamid Shojanazeri <hamid.nazeri2010@gmail.com>

New folder structure (#1)
6d449a85 · Suraj Subramanian · GitHub · 85ea8691 · 85ea8691 · 6d449a85
Unverified Commit 6d449a85 authored 1 year ago by Suraj Subramanian Committed by GitHub 1 year ago
--- a/.vscode/settings.json
+++ b/.vscode/settings.json
-{
-    "python.testing.unittestArgs": [
-        "-v",
-        "-s",
-        "./tests",
-        "-p",
-        "test_*.py"
-    ],
-    "python.testing.pytestEnabled": false,
-    "python.testing.unittestEnabled": true
-}
\ No newline at end of file
--- a/README.md
+++ b/README.md
--- a/demo_apps/README.md
+++ b/demo_apps/README.md
-# Llama 2 Demo Apps
-This folder contains a series of Llama 2-powered apps:
-* Quickstart Llama deployments and basic interactions with Llama
-1. Llama on your Mac and ask Llama general questions
-2. Llama on Google Colab
-3. Llama on Cloud and ask Llama questions about unstructured data in a PDF
-4. Llama on-prem with vLLM and TGI
-5. Llama chatbot with RAG (Retrieval Augmented Generation)
-6. Azure Llama 2 API (Model-as-a-Service)
-* Specialized Llama use cases:
-1. Ask Llama to summarize a video content
-2. Ask Llama questions about structured data in a DB
-3. Ask Llama questions about live data on the web
-4. Build a Llama-enabled WhatsApp chatbot
-5. Build a Llama-enabled Messenger chatbot
-We also show how to build quick web UI for Llama 2 demo apps using Streamlit and Gradio.
-If you need a general understanding of GenAI, Llama 2, prompt engineering and RAG (Retrieval Augmented Generation), be sure to first check the [Getting to know Llama 2 notebook](https://github.com/facebookresearch/llama-recipes/blob/main/examples/Getting_to_know_Llama.ipynb) and its Meta Connect video [here](https://www.facebook.com/watch/?v=662153709222699).
-More advanced Llama 2 demo apps will be coming soon.
-## Setting Up Environment
-The quickest way to test run the notebook demo apps on your local machine is to create a Conda envinronment and start running the Jupyter notebook as follows:
-```
-conda create -n llama-demo-apps python=3.8
-conda activate llama-demo-apps
-pip install jupyter
-cd <your_work_folder>
-git clone https://github.com/facebookresearch/llama-recipes
-cd llama-recipes/demo-apps
-jupyter notebook
-```
-You can also upload the notebooks to Google Colab.
-## HelloLlama - Quickstart in Running Llama2 (Almost) Everywhere*
-The first three demo apps show:
-* how to run Llama2 locally on a Mac, in the Google Colab notebook, and in the cloud using Replicate;
-* how to use [LangChain](https://github.com/langchain-ai/langchain), an open-source framework for building LLM apps, to ask Llama general questions in different ways;
-* how to use LangChain to load a recent PDF doc - the Llama2 paper pdf - and ask questions about it. This is the well known RAG method to let LLM such as Llama2 be able to answer questions about the data not publicly available when Llama2 was trained, or about your own data. RAG is one way to prevent LLM's hallucination.
-* how to ask follow up questions to Llama by sending previous questions and answers as the context along with the new question, hence performing multi-turn chat or conversation with Llama.
-### [Running Llama2 Locally on Mac](HelloLlamaLocal.ipynb)
-To run Llama2 locally on Mac using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), first open the notebook `HelloLlamaLocal`. Then replace `<path-to-llama-gguf-file>` in the notebook `HelloLlamaLocal` with the path either to your downloaded quantized model file [here](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_0.gguf), or to the `ggml-model-q4_0.gguf` file built with the following commands:
-```
-git clone https://github.com/ggerganov/llama.cpp
-cd llama.cpp
-python3 -m pip install -r requirements.txt
-python convert.py <path_to_your_downloaded_llama-2-13b_model>
-./quantize <path_to_your_downloaded_llama-2-13b_model>/ggml-model-f16.gguf <path_to_your_downloaded_llama-2-13b_model>/ggml-model-q4_0.gguf q4_0
-```
-### Running Llama2 Hosted in the Cloud (using [Replicate](HelloLlamaCloud.ipynb) or [OctoAI](OctoAI_API_examples/HelloLlamaCloud.ipynb))
-The HelloLlama cloud version uses LangChain with Llama2 hosted in the cloud on [Replicate](HelloLlamaCloud.ipynb) and [OctoAI](OctoAI_API_examples/HelloLlamaCloud.ipynb). The demo shows how to ask Llama general questions and follow up questions, and how to use LangChain to ask Llama2 questions about **unstructured** data stored in a PDF.
-**<a id="replicate_note">Note on using Replicate</a>**
-To run some of the demo apps here, you'll need to first sign in with Replicate with your github account, then create a free API token [here](https://replicate.com/account/api-tokens) that you can use for a while. After the free trial ends, you'll need to enter billing info to continue to use Llama2 hosted on Replicate - according to Replicate's [Run time and cost](https://replicate.com/meta/llama-2-13b-chat) for the Llama2-13b-chat model used in our demo apps, the model "costs $0.000725 per second. Predictions typically complete within 10 seconds." This means each call to the Llama2-13b-chat model costs less than $0.01 if the call completes within 10 seconds. If you want absolutely no costs, you can refer to the section "Running Llama2 locally on Mac" above or the "Running Llama2 in Google Colab" below.
-**<a id="octoai_note">Note on using OctoAI</a>**
-You can also use [OctoAI](https://octo.ai/) to run some of the Llama demos under [OctoAI_API_examples](OctoAI_API_examples/). You can sign into [OctoAI](https://octoai.cloud) with your Google or GitHub account, which will give you $10 of free credits you can use for a month. Llama2 on OctoAI is priced at [$0.00086 per 1k tokens](https://octo.ai/pricing/) (a ~350-word LLM response), so $10 of free credits should go a very long way (about 10,000 LLM inferences).
-### [Running Llama2 in Google Colab](https://colab.research.google.com/drive/1-uBXt4L-6HNS2D8Iny2DwUpVS4Ub7jnk?usp=sharing)
-To run Llama2 in Google Colab using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), download the quantized Llama2-7b-chat model [here](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_0.gguf), or follow the instructions above to build it, before uploading it to your Google drive. Note that on the free Colab T4 GPU, the call to Llama could take more than 20 minutes to return; running the notebook locally on M1 MBP takes about 20 seconds.
-## [Running Llama2 On-Prem with vLLM and TGI](llama-on-prem.md)
-This tutorial shows how to use Llama 2 with [vLLM](https://github.com/vllm-project/vllm) and Hugging Face [TGI](https://github.com/huggingface/text-generation-inference) to build Llama 2 on-prem apps.
-\* To run a quantized Llama2 model on iOS and Android, you can use  the open source [MLC LLM](https://github.com/mlc-ai/mlc-llm) or [llama.cpp](https://github.com/ggerganov/llama.cpp). You can even make a Linux OS that boots to Llama2 ([repo](https://github.com/trholding/llama2.c)).
-## VideoSummary: Ask Llama2 to Summarize a YouTube Video (using [Replicate](VideoSummary.ipynb) or [OctoAI](OctoAI_API_examples/VideoSummary.ipynb))
-This demo app uses Llama2 to return a text summary of a YouTube video. It shows how to retrieve the caption of a YouTube video and how to ask Llama to summarize the content in four different ways, from the simplest naive way that works for short text to more advanced methods of using LangChain's map_reduce and refine to overcome the 4096 limit of Llama's max input token size.
-## [NBA2023-24](StructuredLlama.ipynb): Ask Llama2 about Structured Data
-This demo app shows how to use LangChain and Llama2 to let users ask questions about **structured** data stored in a SQL DB. As the 2023-24 NBA season is around the corner, we use the NBA roster info saved in a SQLite DB to show you how to ask Llama2 questions about your favorite teams or players.
-## LiveData: Ask Llama2 about Live Data (using [Replicate](LiveData.ipynb) or [OctoAI](OctoAI_API_examples/LiveData.ipynb))
-This demo app shows how to perform live data augmented generation tasks with Llama2 and [LlamaIndex](https://github.com/run-llama/llama_index), another leading open-source framework for building LLM apps: it uses the [You.com search API](https://documentation.you.com/quickstart) to get live search result and ask Llama2 about them.
-## [WhatsApp Chatbot](whatsapp_llama2.md): Building a Llama-enabled WhatsApp Chatbot
-This step-by-step tutorial shows how to use the [WhatsApp Business API](https://developers.facebook.com/docs/whatsapp/cloud-api/overview) to build a Llama-enabled WhatsApp chatbot.
-## [Messenger Chatbot](messenger_llama2.md): Building a Llama-enabled Messenger Chatbot
-This step-by-step tutorial shows how to use the [Messenger Platform](https://developers.facebook.com/docs/messenger-platform/overview) to build a Llama-enabled Messenger chatbot.
-## Quick Web UI for Llama2 Chat
-If you prefer to see Llama2 in action in a web UI, instead of the notebooks above, you can try one of the two methods:
-### Running [Streamlit](https://streamlit.io/) with Llama2
-Open a Terminal, run the following commands:
-```
-pip install streamlit langchain replicate
-git clone https://github.com/facebookresearch/llama-recipes
-cd llama-recipes/llama-demo-apps
-```
-Replace the `<your replicate api token>` in `streamlit_llama2.py` with your API token created [here](https://replicate.com/account/api-tokens) - for more info, see the note [above](#replicate_note).
-Then run the command `streamlit run streamlit_llama2.py` and you'll see on your browser the following UI with question and answer - you can enter new text question, click Submit, and see Llama2's answer:
-![](llama2-streamlit.png)
-![](llama2-streamlit2.png)
-### Running [Gradio](https://www.gradio.app/) with Llama2 (using [Replicate](Llama2_Gradio.ipynb) or [OctoAI](OctoAI_API_examples/Llama2_Gradio.ipynb))
-To see how to query Llama2 and get answers with the Gradio UI both from the notebook and web, just launch the notebook `Llama2_Gradio.ipynb`. For more info, on how to get set up with a token to power these apps, see the note on [Replicate](#replicate_note) and [OctoAI](#octoai_note).
-Then enter your question, click Submit. You'll see in the notebook or a browser with URL http://127.0.0.1:7860 the following UI:
-![](llama2-gradio.png)
-### RAG Chatbot Example (running [locally](RAG_Chatbot_example/RAG_Chatbot_Example.ipynb) or on [OctoAI](OctoAI_API_examples/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb))
-A complete example of how to build a Llama 2 chatbot hosted on your browser that can answer questions based on your own data using retrieval augmented generation (RAG). You can run Llama2 locally if you have a good enough GPU or on OctoAI if you follow the note [above](#octoai_note).
-### [Azure API Llama 2 Example](Azure_API_example/azure_api_example.ipynb)
-A notebook shows examples of how to use Llama 2 APIs offered by Microsoft Azure Model-as-a-Service in CLI, Python, LangChain and a Gradio chatbot example with memory.
--- a/docs/FAQ.md
+++ b/docs/FAQ.md
@@ -16,7 +16,7 @@ Here we discuss frequently asked questions that may occur and we found useful al
 4. Can I add custom datasets?
-    Yes, you can find more information on how to do that [here](Dataset.md).
+    Yes, you can find more information on how to do that [here](../recipes/finetuning/datasets/README.md).
 5. What are the hardware SKU requirements for deploying these models?

--- a/demo_apps/llama2-gradio.png
+++ b/demo_apps/llama2-gradio.png
--- a/demo_apps/llama2-streamlit.png
+++ b/demo_apps/llama2-streamlit.png
--- a/demo_apps/llama2-streamlit2.png
+++ b/demo_apps/llama2-streamlit2.png
--- a/demo_apps/messenger_api_settings.png
+++ b/demo_apps/messenger_api_settings.png
--- a/demo_apps/messenger_llama_arch.jpg
+++ b/demo_apps/messenger_llama_arch.jpg
--- a/demo_apps/whatsapp_dashboard.jpg
+++ b/demo_apps/whatsapp_dashboard.jpg
--- a/demo_apps/whatsapp_llama_arch.jpg
+++ b/demo_apps/whatsapp_llama_arch.jpg
--- a/docs/multi_gpu.md
+++ b/docs/multi_gpu.md
@@ -9,7 +9,7 @@ To run fine-tuning on multi-GPUs, we will  make use of two packages:
 Given the combination of PEFT and FSDP, we would be able to fine tune a Llama 2 model on multiple GPUs in one node or multi-node.
 ## Requirements 
-To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`examples/finetuning.py`](../examples/finetuning.py) script with torchrun (See [README.md](../README.md) for details).
+To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`finetuning.py`](../recipes/finetuning/finetuning.py) script with torchrun (See [README.md](../README.md) for details).
 **Please note that the llama_recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**

--- a/examples/README.md
+++ b/examples/README.md
-# Examples
-This folder contains finetuning and inference examples for Llama 2, Code Llama and (Purple Llama](https://ai.meta.com/llama/purple-llama/). For the full documentation on these examples please refer to [docs/inference.md](../docs/inference.md)
-## Finetuning
-Please refer to the main [README.md](../README.md) for information on how to use the [finetuning.py](./finetuning.py) script.
-After installing the llama-recipes package through [pip](../README.md#installation) you can also invoke the finetuning in two ways:
-```
-python -m llama_recipes.finetuning <parameters>
-python examples/finetuning.py <parameters>
-```
-Please see [README.md](../README.md) for details.
-## Inference
-So far, we have provide the following inference examples:
-1. [inference script](./inference.py) script provides support for Hugging Face accelerate, PEFT and FSDP fine tuned models. It also demonstrates safety features to protect the user from toxic or harmful content.
-2. [vllm/inference.py](./vllm/inference.py) script takes advantage of vLLM's paged attention concept for low latency.
-3. The [hf_text_generation_inference](./hf_text_generation_inference/README.md) folder contains information on Hugging Face Text Generation Inference (TGI).
-4. A [chat completion](./chat_completion/chat_completion.py) example highlighting the handling of chat dialogs.
-5. [Code Llama](./code_llama/) folder which provides examples for [code completion](./code_llama/code_completion_example.py), [code infilling](./code_llama/code_infilling_example.py) and [Llama2 70B code instruct](./code_llama/code_instruct_example.py).
-6. The [Purple Llama Using Anyscale](./Purple_Llama_Anyscale.ipynb) and the [Purple Llama Using OctoAI](./Purple_Llama_OctoAI.ipynb) are notebooks that shows how to use Llama Guard model on Anyscale and OctoAI to classify user inputs as safe or unsafe.
-7. [Llama Guard](./llama_guard/) inference example and [safety_checker](../src/llama_recipes/inference/safety_utils.py) for the main [inference](./inference.py) script. The standalone scripts allows to test Llama Guard on user input, or user input and agent response pairs. The safety_checker integration providers a way to integrate Llama Guard on all inference executions, both for the user input and model output.
-For more in depth information on inference including inference safety checks and examples, see the inference documentation [here](../docs/inference.md).
-**Note** The [sensitive topics safety checker](../src/llama_recipes/inference/safety_utils.py) utilizes AuditNLG which is an optional dependency. Please refer to installation section of the main [README.md](../README.md#install-with-optional-dependencies) for details.
-**Note** The **vLLM** example requires additional dependencies. Please refer to installation section of the main [README.md](../README.md#install-with-optional-dependencies) for details.
-## Train on custom dataset
-To show how to train a model on a custom dataset we provide an example to generate a custom dataset in [custom_dataset.py](./custom_dataset.py).
-The usage of the custom dataset is further described in the datasets [README](../docs/Dataset.md#training-on-custom-data).
--- a/recipes/README.md
+++ b/recipes/README.md
+This folder contains examples organized by topic:
+| Subfolder | Description |
+|---|---|
+[quickstart](./quickstart) | The "Hello World" of using Llama2, start here if you are new to using Llama2.
+[finetuning](./finetuning)|Scripts to finetune Llama2 on single-GPU and multi-GPU setups
+[inference](./inference)|Scripts to deploy Llama2 for inference locally and using model servers
+[use_cases](./use_cases)|Scripts showing common applications of Llama2
+[responsible_ai](./responsible_ai)|Scripts to use PurpleLlama for safeguarding model outputs
+[llama_api_providers](./llama_api_providers)|Scripts to run inference on Llama via hosted endpoints
+[benchmarks](./benchmarks)|Scripts to benchmark Llama 2 models inference on various backends
+[code_llama](./code_llama)|Scripts to run inference with the Code Llama models
+[evaluation](./evaluation)|Scripts to evaluate fine-tuned Llama2 models using `lm-evaluation-harness` from `EleutherAI`
+**<a id="replicate_note">Note on using Replicate</a>**
+To run some of the demo apps here, you'll need to first sign in with Replicate with your github account, then create a free API token [here](https://replicate.com/account/api-tokens) that you can use for a while. After the free trial ends, you'll need to enter billing info to continue to use Llama2 hosted on Replicate - according to Replicate's [Run time and cost](https://replicate.com/meta/llama-2-13b-chat) for the Llama2-13b-chat model used in our demo apps, the model "costs $0.000725 per second. Predictions typically complete within 10 seconds." This means each call to the Llama2-13b-chat model costs less than $0.01 if the call completes within 10 seconds. If you want absolutely no costs, you can refer to the section "Running Llama2 locally on Mac" above or the "Running Llama2 in Google Colab" below.
+**<a id="octoai_note">Note on using OctoAI</a>**
+You can also use [OctoAI](https://octo.ai/) to run some of the Llama demos under [OctoAI_API_examples](./llama_api_providers/OctoAI_API_examples/). You can sign into [OctoAI](https://octoai.cloud) with your Google or GitHub account, which will give you $10 of free credits you can use for a month. Llama2 on OctoAI is priced at [$0.00086 per 1k tokens](https://octo.ai/pricing/) (a ~350-word LLM response), so $10 of free credits should go a very long way (about 10,000 LLM inferences).
+### [Running Llama2 in Google Colab](https://colab.research.google.com/drive/1-uBXt4L-6HNS2D8Iny2DwUpVS4Ub7jnk?usp=sharing)
+To run Llama2 in Google Colab using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), download the quantized Llama2-7b-chat model [here](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_0.gguf), or follow the instructions above to build it, before uploading it to your Google drive. Note that on the free Colab T4 GPU, the call to Llama could take more than 20 minutes to return; running the notebook locally on M1 MBP takes about 20 seconds.
\ No newline at end of file
--- a/benchmarks/inference/README.md
+++ b/benchmarks/inference/README.md
--- a/benchmarks/inference/on-prem/README.md
+++ b/benchmarks/inference/on-prem/README.md
--- a/benchmarks/inference/on-prem/vllm/chat_vllm_benchmark.py
+++ b/benchmarks/inference/on-prem/vllm/chat_vllm_benchmark.py
--- a/benchmarks/inference/on-prem/vllm/input.jsonl
+++ b/benchmarks/inference/on-prem/vllm/input.jsonl
--- a/benchmarks/inference/on-prem/vllm/parameters.json
+++ b/benchmarks/inference/on-prem/vllm/parameters.json
--- a/benchmarks/inference/on-prem/vllm/pretrained_vllm_benchmark.py
+++ b/benchmarks/inference/on-prem/vllm/pretrained_vllm_benchmark.py