diff --git a/.vscode/settings.json b/.vscode/settings.json
deleted file mode 100644
index e9e6a805e7a729113356b60277e0afe47416e436..0000000000000000000000000000000000000000
--- a/.vscode/settings.json
+++ /dev/null
@@ -1,11 +0,0 @@
-{
-    "python.testing.unittestArgs": [
-        "-v",
-        "-s",
-        "./tests",
-        "-p",
-        "test_*.py"
-    ],
-    "python.testing.pytestEnabled": false,
-    "python.testing.unittestEnabled": true
-}
\ No newline at end of file
diff --git a/README.md b/README.md
index c418e186a314ce55079a4e2639f4f672dbd68145..3cbb8d96b05120e8ec985fa9424d78d05e9e906f 100644
--- a/README.md
+++ b/README.md
@@ -1,45 +1,49 @@
-# Llama 2 Fine-tuning / Inference Recipes, Examples, Benchmarks and Demo Apps
+# Llama Recipes: Examples to get started using the Llama models from Meta
 
-**[Update Feb. 26, 2024] We added examples to showcase OctoAI's cloud APIs for Llama2, CodeLlama, and LlamaGuard: including [PurpleLlama overview](./examples/Purple_Llama_OctoAI.ipynb), [hello Llama2 cloud](./demo_apps/OctoAI_API_examples/HelloLlamaCloud.ipynb), [getting to know Llama2](./demo_apps/OctoAI_API_examples/Getting_to_know_Llama.ipynb), [live search example](./demo_apps/OctoAI_API_examples/LiveData.ipynb), [Llama2 Gradio demo](./demo_apps/OctoAI_API_examples/Llama2_Gradio.ipynb), [Youtube video summarization](./demo_apps/OctoAI_API_examples/VideoSummary.ipynb), and [retrieval augmented generation overview](./demo_apps/OctoAI_API_examples/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb)**.
+The 'llama-recipes' repository is a companion to the [Llama 2 model](https://github.com/facebookresearch/llama). The goal of this repository is to provide a scalable library for fine-tuning Llama 2, along with some example scripts and notebooks to quickly get started with using the Llama 2 models in a variety of use-cases, including fine-tuning for domain adaptation and building LLM-based applications with Llama 2 and other tools in the LLM ecosystem. The examples here showcase how to run Llama 2 locally, in the cloud, and on-prem.
 
-**[Update Feb. 5, 2024] We added support for Code Llama 70B instruct in our example [inference script](./examples/code_llama/code_instruct_example.py). For details on formatting the prompt for Code Llama 70B instruct model please refer to [this document](./docs/inference.md)**.
+> [!NOTE]
+> The llama-recipes repository was recently refactored to promote a better developer experience of using the examples. Some files have been moved to new locations. The `src/` folder has NOT been modified, so the functionality of this repo and package is not impacted.
+> 
+> Make sure you update your local clone by running `git pull origin main`
 
-**[Update Dec. 28, 2023] We added support for Llama Guard as a safety checker for our example inference script and also with standalone inference with an example script and prompt formatting. More details [here](./examples/llama_guard/README.md). For details on formatting data for fine tuning Llama Guard, we provide a script and sample usage [here](./src/llama_recipes/data/llama_guard/README.md).**
+## Table of Contents
 
-**[Update Dec 14, 2023] We recently released a series of Llama 2 demo apps [here](./demo_apps). These apps show how to run Llama (locally, in the cloud, or on-prem),  how to use Azure Llama 2 API (Model-as-a-Service), how to ask Llama questions in general or about custom data (PDF, DB, or live), how to integrate Llama with WhatsApp and Messenger, and how to implement an end-to-end chatbot with RAG (Retrieval Augmented Generation).**
+- [Llama Recipes: Examples to get started using the Llama models from Meta](#llama-recipes-examples-to-get-started-using-the-llama-models-from-meta)
+  - [Table of Contents](#table-of-contents)
+  - [Getting Started](#getting-started)
+    - [Prerequisites](#prerequisites)
+      - [PyTorch Nightlies](#pytorch-nightlies)
+    - [Installing](#installing)
+      - [Install with pip](#install-with-pip)
+      - [Install with optional dependencies](#install-with-optional-dependencies)
+      - [Install from source](#install-from-source)
+    - [Getting the Llama models](#getting-the-llama-models)
+      - [Model conversion to Hugging Face](#model-conversion-to-hugging-face)
+  - [Repository Organization](#repository-organization)
+    - [`recipes/`](#recipes)
+    - [`src/`](#src)
+  - [Contributing](#contributing)
+  - [License](#license)
 
-The 'llama-recipes' repository is a companion to the [Llama 2 model](https://github.com/facebookresearch/llama). The goal of this repository is to provide examples to quickly get started with fine-tuning for domain adaptation and how to run inference for the fine-tuned models. For ease of use, the examples use Hugging Face converted versions of the models. See steps for conversion of the model [here](#model-conversion-to-hugging-face).
+## Getting Started
 
-In addition, we also provide a number of demo apps, to showcase the Llama 2 usage along with other ecosystem solutions to run Llama 2 locally, in the cloud, and on-prem.
+These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
 
-Llama 2 is a new technology that carries potential risks with use. Testing conducted to date has not â€” and could not â€” cover all scenarios. In order to help developers address these risks, we have created the [Responsible Use Guide](https://github.com/facebookresearch/llama/blob/main/Responsible-Use-Guide.pdf). More details can be found in our research paper as well. For downloading the models, follow the instructions on [Llama 2 repo](https://github.com/facebookresearch/llama).
+### Prerequisites
 
+#### PyTorch Nightlies
+Some features (especially fine-tuning with FSDP + PEFT) currently require PyTorch nightlies to be installed. Please make sure to install the nightlies if you're using these features following [this guide](https://pytorch.org/get-started/locally/).
 
-# Table of Contents
-1. [Quick start](#quick-start)
-2. [Model Conversion](#model-conversion-to-hugging-face)
-3. [Fine-tuning](#fine-tuning)
-    - [Single GPU](#single-gpu)
-    - [Multi GPU One Node](#multiple-gpus-one-node)
-    - [Multi GPU Multi Node](#multi-gpu-multi-node)
-4. [Inference](./docs/inference.md)
-5. [Demo Apps](#demo-apps)
-6. [Repository Organization](#repository-organization)
-7. [License and Acceptable Use Policy](#license)
-
-# Quick Start
-
-[Llama 2 Jupyter Notebook](./examples/quickstart.ipynb): This jupyter notebook steps you through how to finetune a Llama 2 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum). The notebook uses parameter efficient finetuning (PEFT) and int8 quantization to finetune a 7B on a single GPU like an A10 with 24GB gpu memory.
-
-# Installation
+### Installing
 Llama-recipes provides a pip distribution for easy install and usage in other projects. Alternatively, it can be installed from source.
 
-## Install with pip
+#### Install with pip
 ```
 pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 llama-recipes
 ```
 
-## Install with optional dependencies
+#### Install with optional dependencies
 Llama-recipes offers the installation of optional packages. There are three optional dependency groups.
 To run the unit tests we can install the required dependencies with:
 ```
@@ -55,7 +59,7 @@ pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 llama-
 ```
 Optional dependencies can also be combines with [option1,option2].
 
-## Install from source
+#### Install from source
 To install from source e.g. for development use these commands. We're using hatchling as our build backend which requires an up-to-date pip as well as setuptools package.
 ```
 git clone git@github.com:facebookresearch/llama-recipes.git
@@ -71,25 +75,11 @@ pip install -U pip setuptools
 pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 -e .[tests,auditnlg,vllm]
 ```
 
-âš ï¸ **Note** âš ï¸  Some features (especially fine-tuning with FSDP + PEFT) currently require PyTorch nightlies to be installed. Please make sure to install the nightlies if you're using these features following [this guide](https://pytorch.org/get-started/locally/).
-
-**Note** All the setting defined in [config files](src/llama_recipes/configs/) can be passed as args through CLI when running the script, there is no need to change from config files directly.
-
-**For more in depth information checkout the following:**
-
-* [Single GPU Fine-tuning](./docs/single_gpu.md)
-* [Multi-GPU Fine-tuning](./docs/multi_gpu.md)
-* [LLM Fine-tuning](./docs/LLM_finetuning.md)
-* [Adding custom datasets](./docs/Dataset.md)
-* [Inference](./docs/inference.md)
-* [Evaluation Harness](./eval/README.md)
-* [FAQs](./docs/FAQ.md)
 
-# Where to find the models?
+### Getting the Llama models
+You can find Llama 2 models on Hugging Face hub [here](https://huggingface.co/meta-llama), **where models with `hf` in the name are already converted to Hugging Face checkpoints so no further conversion is needed**. The conversion step below is only for original model weights from Meta that are hosted on Hugging Face model hub as well.
 
-You can find Llama 2 models on Hugging Face hub [here](https://huggingface.co/meta-llama), where models with `hf` in the name are already converted to Hugging Face checkpoints so no further conversion is needed. The conversion step below is only for original model weights from Meta that are hosted on Hugging Face model hub as well.
-
-# Model conversion to Hugging Face
+#### Model conversion to Hugging Face
 The recipes and notebooks in this folder are using the Llama 2 model definition provided by Hugging Face's transformers library.
 
 Given that the original checkpoint resides under models/7B you can install all requirements and convert the checkpoint with:
@@ -105,174 +95,43 @@ python src/transformers/models/llama/convert_llama_weights_to_hf.py \
    --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path
 ```
 
-# Fine-tuning
-
-For fine-tuning Llama 2 models for your domain-specific use cases recipes for PEFT, FSDP, PEFT+FSDP have been included along with a few test datasets. For details see [LLM Fine-tuning](./docs/LLM_finetuning.md).
-
-## Single and Multi GPU Finetune
-
-If you want to dive right into single or multi GPU fine-tuning, run the examples below on a single GPU like A10, T4, V100, A100 etc.
-All the parameters in the examples and recipes below need to be further tuned to have desired results based on the model, method, data and task at hand.
-
-**Note:**
-* To change the dataset in the commands below pass the `dataset` arg. Current options for integrated dataset are `grammar_dataset`, `alpaca_dataset`and  `samsum_dataset`. Additionally, we integrate the OpenAssistant/oasst1 dataset as an [example for a custom dataset](./examples/custom_dataset.py).  A description of how to use your own dataset and how to add custom datasets can be found in [Dataset.md](./docs/Dataset.md#using-custom-datasets). For  `grammar_dataset`, `alpaca_dataset` please make sure you use the suggested instructions from [here](./docs/single_gpu.md#how-to-run-with-different-datasets) to set them up.
-
-* Default dataset and other LORA config has been set to `samsum_dataset`.
-
-* Make sure to set the right path to the model in the [training config](src/llama_recipes/configs/training.py).
-
-* To save the loss and perplexity metrics for evaluation, enable this by passing `--save_metrics` to the finetuning script. The file can be plotted using the [plot_metrics.py](./examples/plot_metrics.py) script, `python examples/plot_metrics.py --file_path path/to/metrics.json`
-
-### Single GPU:
-
-```bash
-#if running on multi-gpu machine
-export CUDA_VISIBLE_DEVICES=0
-
-python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization --model_name /path_of_model_folder/7B --output_dir path/to/save/PEFT/model
-
-```
-
-Here we make use of Parameter Efficient Methods (PEFT) as described in the next section. To run the command above make sure to pass the `peft_method` arg which can be set to `lora`, `llama_adapter` or `prefix`.
-
-**Note** if you are running on a machine with multiple GPUs please make sure to only make one of them visible using `export CUDA_VISIBLE_DEVICES=GPU:id`
-
-**Make sure you set `save_model` parameter to save the model. Be sure to check the other training parameter in [train config](src/llama_recipes/configs/training.py) as well as others in the config folder as needed. All parameter can be passed as args to the training script. No need to alter the config files.**
-
-
-### Multiple GPUs One Node:
-
-**NOTE** please make sure to use PyTorch Nightlies for using PEFT+FSDP. Also, note that int8 quantization from bit&bytes currently is not supported in FSDP.
-
-```bash
-
-torchrun --nnodes 1 --nproc_per_node 4  examples/finetuning.py --enable_fsdp --use_peft --peft_method lora --model_name /path_of_model_folder/7B --fsdp_config.pure_bf16 --output_dir path/to/save/PEFT/model
-
-```
-
-Here we use FSDP as discussed in the next section which can be used along with PEFT methods. To make use of PEFT methods with FSDP make sure to pass `use_peft` and `peft_method` args along with `enable_fsdp`. Here we are using `BF16` for training.
-
-## Flash Attention and Xformer Memory Efficient Kernels
-
-Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up the fine-tuning job. This has been enabled in `optimum` library from Hugging Face as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
-
-```bash
-torchrun --nnodes 1 --nproc_per_node 4  examples/finetuning.py --enable_fsdp --use_peft --peft_method lora --model_name /path_of_model_folder/7B --fsdp_config.pure_bf16 --output_dir path/to/save/PEFT/model --use_fast_kernels
-```
-
-### Fine-tuning using FSDP Only
-
-If you are interested in running full parameter fine-tuning without making use of PEFT methods, please use the following command. Make sure to change the `nproc_per_node` to your available GPUs. This has been tested with `BF16` on 8xA100, 40GB GPUs.
-
-```bash
-
-torchrun --nnodes 1 --nproc_per_node 8  examples/finetuning.py --enable_fsdp --model_name /path_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --use_fast_kernels
-
-```
-
-### Fine-tuning using FSDP on 70B Model
 
-If you are interested in running full parameter fine-tuning on the 70B model, you can enable `low_cpu_fsdp` mode as the following command. This option will load model on rank0 only before moving model to devices to construct FSDP. This can dramatically save cpu memory when loading large models like 70B (on a 8-gpu node, this reduces cpu memory from 2+T to 280G for 70B model). This has been tested with `BF16` on 16xA100, 80GB GPUs.
 
-```bash
+## Repository Organization
+Most of the code dealing with Llama usage is organized across 2 main folders: `recipes/` and `src/`. 
 
-torchrun --nnodes 1 --nproc_per_node 8 examples/finetuning.py --enable_fsdp --low_cpu_fsdp --fsdp_config.pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
+### `recipes/`
 
-```
+Contains examples are organized in folders by topic:
+| Subfolder | Description |
+|---|---|
+[quickstart](./recipes/quickstart) | The "Hello World" of using Llama2, start here if you are new to using Llama2.
+[finetuning](./recipes/finetuning)|Scripts to finetune Llama2 on single-GPU and multi-GPU setups
+[inference](./recipes/inference)|Scripts to deploy Llama2 for inference locally and using model servers
+[use_cases](./recipes/use_cases)|Scripts showing common applications of Llama2
+[responsible_ai](./recipes/responsible_ai)|Scripts to use PurpleLlama for safeguarding model outputs
+[llama_api_providers](./recipes/llama_api_providers)|Scripts to run inference on Llama via hosted endpoints
+[benchmarks](./recipes/benchmarks)|Scripts to benchmark Llama 2 models inference on various backends
+[code_llama](./recipes/code_llama)|Scripts to run inference with the Code Llama models
+[evaluation](./recipes/evaluation)|Scripts to evaluate fine-tuned Llama2 models using `lm-evaluation-harness` from `EleutherAI`
 
-In case you are dealing with slower interconnect network between nodes, to reduce the communication overhead you can make use of `--hsdp` flag. 
+### `src/`
 
-HSDP (Hybrid sharding Data Parallel) helps to define a hybrid sharding strategy where you can have FSDP within `sharding_group_size` which can be the minimum number of GPUs you can fit your model and DDP between the replicas of the model specified by `replica_group_size`.
+Contains modules which support the example recipes:
+| Subfolder | Description |
+|---|---|
+| [configs](src/llama_recipes/configs/) | Contains the configuration files for PEFT methods, FSDP, Datasets, Weights & Biases experiment tracking. |
+| [datasets](src/llama_recipes/datasets/) | Contains individual scripts for each dataset to download and process. Note |
+| [inference](src/llama_recipes/inference/) | Includes modules for inference for the fine-tuned models. |
+| [model_checkpointing](src/llama_recipes/model_checkpointing/) | Contains FSDP checkpoint handlers. |
+| [policies](src/llama_recipes/policies/) | Contains FSDP scripts to provide different policies, such as mixed precision, transformer wrapping policy and activation checkpointing along with any precision optimizer (used for running FSDP with pure bf16 mode). |
+| [utils](src/llama_recipes/utils/) | Utility files for:<br/> - `train_utils.py` provides training/eval loop and more train utils.<br/> - `dataset_utils.py` to get preprocessed datasets.<br/> - `config_utils.py` to override the configs received from CLI.<br/> - `fsdp_utils.py` provides FSDP  wrapping policy for PEFT methods.<br/> - `memory_utils.py` context manager to track different memory stats in train loop. |
 
-This will require to set the Sharding strategy in [fsdp config](./src/llama_recipes/configs/fsdp.py) to `ShardingStrategy.HYBRID_SHARD` and specify two additional settings, `sharding_group_size` and `replica_group_size` where former specifies the sharding group size, number of GPUs that you model can fit into to form a replica of a model and latter specifies the replica group size, which is world_size/sharding_group_size.
 
+## Contributing
 
-```bash
-
-torchrun --nnodes 4 --nproc_per_node 8 examples/finetuning.py --enable_fsdp --low_cpu_fsdp --fsdp_config.pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --hsdp --sharding_group_size n --replica_group_size world_size/n
-
-```
-
-### Multi GPU Multi Node:
-
-```bash
-
-sbatch multi_node.slurm
-# Change the num nodes and GPU per nodes in the script before running.
-
-```
-You can read more about our fine-tuning strategies [here](./docs/LLM_finetuning.md).
-
-## Weights & Biases Experiment Tracking
-
-You can enable [W&B](https://wandb.ai/) experiment tracking by using `use_wandb` flag as below. You can change the project name, entity and other `wandb.init` arguments in `wandb_config`.
-
-```bash
-python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model --use_wandb
-```
-You'll be able to access a dedicated project or run link on [wandb.ai](https://wandb.ai) and see your dashboard like the one below. 
-<div style="display: flex;">
-    <img src="./docs/images/wandb_screenshot.png" alt="wandb screenshot" width="500" />
-</div>
- 
-# Evaluation Harness
+Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct, and the process for submitting pull requests to us.
 
-Here, we make use `lm-evaluation-harness` from `EleutherAI` for evaluation of fine-tuned Llama 2 models. This also can extend to evaluate other optimizations for inference of Llama 2 model such as quantization. Please use this get started [doc](./eval/README.md).
-
-# Demo Apps
-This folder contains a series of Llama2-powered apps:
-* Quickstart Llama deployments and basic interactions with Llama
-1. Llama on your Mac and ask Llama general questions
-2. Llama on Google Colab
-3. Llama on Cloud and ask Llama questions about unstructured data in a PDF
-4. Llama on-prem with vLLM and TGI
-5. Llama chatbot with RAG (Retrieval Augmented Generation)
-6. Azure Llama 2 API (Model-as-a-Service)
-
-* Specialized Llama use cases:
-1. Ask Llama to summarize a video content
-2. Ask Llama questions about structured data in a DB
-3. Ask Llama questions about live data on the web
-4. Build a Llama-enabled WhatsApp chatbot
-
-# Benchmarks
-This folder contains a series of benchmark scripts for Llama 2 models inference on various backends:
-1. On-prem - Popular serving frameworks and containers (i.e. vLLM)
-2. (WIP) Cloud API - Popular API services (i.e. Azure Model-as-a-Service)
-3. (WIP) On-device - Popular on-device inference solutions on Android and iOS (i.e. mlc-llm, QNN)
-4. (WIP) Optimization - Popular optimization solutions for faster inference and quantization (i.e. AutoAWQ)
-
-# Repository Organization
-This repository is organized in the following way:
-[benchmarks](./benchmarks): Contains a series of benchmark scripts for Llama 2 models inference on various backends.
-
-[configs](src/llama_recipes/configs/): Contains the configuration files for PEFT methods, FSDP, Datasets, Weights & Biases experiment tracking.
-
-[docs](docs/): Example recipes for single and multi-gpu fine-tuning recipes.
-
-[datasets](src/llama_recipes/datasets/): Contains individual scripts for each dataset to download and process. Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
-
-[demo_apps](./demo_apps): Contains a series of Llama2-powered apps, from quickstart deployments to how to ask Llama questions about unstructured data, structured data, live data, and video summary.
-
-[examples](./examples/): Contains examples script for finetuning and inference of the Llama 2 model as well as how to use them safely.
-
-[inference](src/llama_recipes/inference/): Includes modules for inference for the fine-tuned models.
-
-[model_checkpointing](src/llama_recipes/model_checkpointing/): Contains FSDP checkpoint handlers.
-
-[policies](src/llama_recipes/policies/): Contains FSDP scripts to provide different policies, such as mixed precision, transformer wrapping policy and activation checkpointing along with any precision optimizer (used for running FSDP with pure bf16 mode).
-
-[utils](src/llama_recipes/utils/): Utility files for:
-
-- `train_utils.py` provides training/eval loop and more train utils.
-
-- `dataset_utils.py` to get preprocessed datasets.
-
-- `config_utils.py` to override the configs received from CLI.
-
-- `fsdp_utils.py` provides FSDP  wrapping policy for PEFT methods.
-
-- `memory_utils.py` context manager to track different memory stats in train loop.
-
-# License
+## License
 See the License file [here](LICENSE) and Acceptable Use Policy [here](USE_POLICY.md)
+
diff --git a/demo_apps/README.md b/demo_apps/README.md
deleted file mode 100644
index e4f59252d9ac82a04a55b86b74cda7ffde2cac1c..0000000000000000000000000000000000000000
--- a/demo_apps/README.md
+++ /dev/null
@@ -1,121 +0,0 @@
-# Llama 2 Demo Apps
-
-This folder contains a series of Llama 2-powered apps:
-* Quickstart Llama deployments and basic interactions with Llama
-1. Llama on your Mac and ask Llama general questions
-2. Llama on Google Colab
-3. Llama on Cloud and ask Llama questions about unstructured data in a PDF
-4. Llama on-prem with vLLM and TGI
-5. Llama chatbot with RAG (Retrieval Augmented Generation)
-6. Azure Llama 2 API (Model-as-a-Service)
-
-* Specialized Llama use cases:
-1. Ask Llama to summarize a video content
-2. Ask Llama questions about structured data in a DB
-3. Ask Llama questions about live data on the web
-4. Build a Llama-enabled WhatsApp chatbot
-5. Build a Llama-enabled Messenger chatbot
-
-We also show how to build quick web UI for Llama 2 demo apps using Streamlit and Gradio.
-
-If you need a general understanding of GenAI, Llama 2, prompt engineering and RAG (Retrieval Augmented Generation), be sure to first check the [Getting to know Llama 2 notebook](https://github.com/facebookresearch/llama-recipes/blob/main/examples/Getting_to_know_Llama.ipynb) and its Meta Connect video [here](https://www.facebook.com/watch/?v=662153709222699).
-
-More advanced Llama 2 demo apps will be coming soon.
-
-## Setting Up Environment
-
-The quickest way to test run the notebook demo apps on your local machine is to create a Conda envinronment and start running the Jupyter notebook as follows:
-```
-conda create -n llama-demo-apps python=3.8
-conda activate llama-demo-apps
-pip install jupyter
-cd <your_work_folder>
-git clone https://github.com/facebookresearch/llama-recipes
-cd llama-recipes/demo-apps
-jupyter notebook
-```
-
-You can also upload the notebooks to Google Colab.
-
-## HelloLlama - Quickstart in Running Llama2 (Almost) Everywhere*
-
-The first three demo apps show:
-* how to run Llama2 locally on a Mac, in the Google Colab notebook, and in the cloud using Replicate;
-* how to use [LangChain](https://github.com/langchain-ai/langchain), an open-source framework for building LLM apps, to ask Llama general questions in different ways;
-* how to use LangChain to load a recent PDF doc - the Llama2 paper pdf - and ask questions about it. This is the well known RAG method to let LLM such as Llama2 be able to answer questions about the data not publicly available when Llama2 was trained, or about your own data. RAG is one way to prevent LLM's hallucination.
-* how to ask follow up questions to Llama by sending previous questions and answers as the context along with the new question, hence performing multi-turn chat or conversation with Llama.
-
-### [Running Llama2 Locally on Mac](HelloLlamaLocal.ipynb)
-To run Llama2 locally on Mac using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), first open the notebook `HelloLlamaLocal`. Then replace `<path-to-llama-gguf-file>` in the notebook `HelloLlamaLocal` with the path either to your downloaded quantized model file [here](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_0.gguf), or to the `ggml-model-q4_0.gguf` file built with the following commands:
-```
-git clone https://github.com/ggerganov/llama.cpp
-cd llama.cpp
-python3 -m pip install -r requirements.txt
-python convert.py <path_to_your_downloaded_llama-2-13b_model>
-./quantize <path_to_your_downloaded_llama-2-13b_model>/ggml-model-f16.gguf <path_to_your_downloaded_llama-2-13b_model>/ggml-model-q4_0.gguf q4_0
-```
-
-### Running Llama2 Hosted in the Cloud (using [Replicate](HelloLlamaCloud.ipynb) or [OctoAI](OctoAI_API_examples/HelloLlamaCloud.ipynb))
-
-The HelloLlama cloud version uses LangChain with Llama2 hosted in the cloud on [Replicate](HelloLlamaCloud.ipynb) and [OctoAI](OctoAI_API_examples/HelloLlamaCloud.ipynb). The demo shows how to ask Llama general questions and follow up questions, and how to use LangChain to ask Llama2 questions about **unstructured** data stored in a PDF.
-
-**<a id="replicate_note">Note on using Replicate</a>**
-To run some of the demo apps here, you'll need to first sign in with Replicate with your github account, then create a free API token [here](https://replicate.com/account/api-tokens) that you can use for a while. After the free trial ends, you'll need to enter billing info to continue to use Llama2 hosted on Replicate - according to Replicate's [Run time and cost](https://replicate.com/meta/llama-2-13b-chat) for the Llama2-13b-chat model used in our demo apps, the model "costs $0.000725 per second. Predictions typically complete within 10 seconds." This means each call to the Llama2-13b-chat model costs less than $0.01 if the call completes within 10 seconds. If you want absolutely no costs, you can refer to the section "Running Llama2 locally on Mac" above or the "Running Llama2 in Google Colab" below.
-
-**<a id="octoai_note">Note on using OctoAI</a>**
-You can also use [OctoAI](https://octo.ai/) to run some of the Llama demos under [OctoAI_API_examples](OctoAI_API_examples/). You can sign into [OctoAI](https://octoai.cloud) with your Google or GitHub account, which will give you $10 of free credits you can use for a month. Llama2 on OctoAI is priced at [$0.00086 per 1k tokens](https://octo.ai/pricing/) (a ~350-word LLM response), so $10 of free credits should go a very long way (about 10,000 LLM inferences).
-
-### [Running Llama2 in Google Colab](https://colab.research.google.com/drive/1-uBXt4L-6HNS2D8Iny2DwUpVS4Ub7jnk?usp=sharing)
-To run Llama2 in Google Colab using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), download the quantized Llama2-7b-chat model [here](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_0.gguf), or follow the instructions above to build it, before uploading it to your Google drive. Note that on the free Colab T4 GPU, the call to Llama could take more than 20 minutes to return; running the notebook locally on M1 MBP takes about 20 seconds.
-
-## [Running Llama2 On-Prem with vLLM and TGI](llama-on-prem.md)
-This tutorial shows how to use Llama 2 with [vLLM](https://github.com/vllm-project/vllm) and Hugging Face [TGI](https://github.com/huggingface/text-generation-inference) to build Llama 2 on-prem apps.
-
-\* To run a quantized Llama2 model on iOS and Android, you can use  the open source [MLC LLM](https://github.com/mlc-ai/mlc-llm) or [llama.cpp](https://github.com/ggerganov/llama.cpp). You can even make a Linux OS that boots to Llama2 ([repo](https://github.com/trholding/llama2.c)).
-
-## VideoSummary: Ask Llama2 to Summarize a YouTube Video (using [Replicate](VideoSummary.ipynb) or [OctoAI](OctoAI_API_examples/VideoSummary.ipynb))
-This demo app uses Llama2 to return a text summary of a YouTube video. It shows how to retrieve the caption of a YouTube video and how to ask Llama to summarize the content in four different ways, from the simplest naive way that works for short text to more advanced methods of using LangChain's map_reduce and refine to overcome the 4096 limit of Llama's max input token size.
-
-## [NBA2023-24](StructuredLlama.ipynb): Ask Llama2 about Structured Data
-This demo app shows how to use LangChain and Llama2 to let users ask questions about **structured** data stored in a SQL DB. As the 2023-24 NBA season is around the corner, we use the NBA roster info saved in a SQLite DB to show you how to ask Llama2 questions about your favorite teams or players.
-
-## LiveData: Ask Llama2 about Live Data (using [Replicate](LiveData.ipynb) or [OctoAI](OctoAI_API_examples/LiveData.ipynb))
-This demo app shows how to perform live data augmented generation tasks with Llama2 and [LlamaIndex](https://github.com/run-llama/llama_index), another leading open-source framework for building LLM apps: it uses the [You.com search API](https://documentation.you.com/quickstart) to get live search result and ask Llama2 about them.
-
-## [WhatsApp Chatbot](whatsapp_llama2.md): Building a Llama-enabled WhatsApp Chatbot
-This step-by-step tutorial shows how to use the [WhatsApp Business API](https://developers.facebook.com/docs/whatsapp/cloud-api/overview) to build a Llama-enabled WhatsApp chatbot.
-
-## [Messenger Chatbot](messenger_llama2.md): Building a Llama-enabled Messenger Chatbot
-This step-by-step tutorial shows how to use the [Messenger Platform](https://developers.facebook.com/docs/messenger-platform/overview) to build a Llama-enabled Messenger chatbot.
-
-## Quick Web UI for Llama2 Chat
-If you prefer to see Llama2 in action in a web UI, instead of the notebooks above, you can try one of the two methods:
-
-### Running [Streamlit](https://streamlit.io/) with Llama2
-Open a Terminal, run the following commands:
-```
-pip install streamlit langchain replicate
-git clone https://github.com/facebookresearch/llama-recipes
-cd llama-recipes/llama-demo-apps
-```
-
-Replace the `<your replicate api token>` in `streamlit_llama2.py` with your API token created [here](https://replicate.com/account/api-tokens) - for more info, see the note [above](#replicate_note).
-
-Then run the command `streamlit run streamlit_llama2.py` and you'll see on your browser the following UI with question and answer - you can enter new text question, click Submit, and see Llama2's answer:
-
-![](llama2-streamlit.png)
-![](llama2-streamlit2.png)
-
-### Running [Gradio](https://www.gradio.app/) with Llama2 (using [Replicate](Llama2_Gradio.ipynb) or [OctoAI](OctoAI_API_examples/Llama2_Gradio.ipynb))
-
-To see how to query Llama2 and get answers with the Gradio UI both from the notebook and web, just launch the notebook `Llama2_Gradio.ipynb`. For more info, on how to get set up with a token to power these apps, see the note on [Replicate](#replicate_note) and [OctoAI](#octoai_note).
-
-Then enter your question, click Submit. You'll see in the notebook or a browser with URL http://127.0.0.1:7860 the following UI:
-
-![](llama2-gradio.png)
-
-### RAG Chatbot Example (running [locally](RAG_Chatbot_example/RAG_Chatbot_Example.ipynb) or on [OctoAI](OctoAI_API_examples/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb))
-A complete example of how to build a Llama 2 chatbot hosted on your browser that can answer questions based on your own data using retrieval augmented generation (RAG). You can run Llama2 locally if you have a good enough GPU or on OctoAI if you follow the note [above](#octoai_note).
-
-### [Azure API Llama 2 Example](Azure_API_example/azure_api_example.ipynb)
-A notebook shows examples of how to use Llama 2 APIs offered by Microsoft Azure Model-as-a-Service in CLI, Python, LangChain and a Gradio chatbot example with memory.
diff --git a/docs/FAQ.md b/docs/FAQ.md
index 8c2e12a7cf8e778ae5c61193d7d414056114f100..4229bedf8e511712698252515e619744e1d88c5d 100644
--- a/docs/FAQ.md
+++ b/docs/FAQ.md
@@ -16,7 +16,7 @@ Here we discuss frequently asked questions that may occur and we found useful al
 
 4. Can I add custom datasets?
 
-    Yes, you can find more information on how to do that [here](Dataset.md).
+    Yes, you can find more information on how to do that [here](../recipes/finetuning/datasets/README.md).
 
 5. What are the hardware SKU requirements for deploying these models?
 
diff --git a/demo_apps/llama2-gradio.png b/docs/images/llama2-gradio.png
similarity index 100%
rename from demo_apps/llama2-gradio.png
rename to docs/images/llama2-gradio.png
diff --git a/demo_apps/llama2-streamlit.png b/docs/images/llama2-streamlit.png
similarity index 100%
rename from demo_apps/llama2-streamlit.png
rename to docs/images/llama2-streamlit.png
diff --git a/demo_apps/llama2-streamlit2.png b/docs/images/llama2-streamlit2.png
similarity index 100%
rename from demo_apps/llama2-streamlit2.png
rename to docs/images/llama2-streamlit2.png
diff --git a/demo_apps/messenger_api_settings.png b/docs/images/messenger_api_settings.png
similarity index 100%
rename from demo_apps/messenger_api_settings.png
rename to docs/images/messenger_api_settings.png
diff --git a/demo_apps/messenger_llama_arch.jpg b/docs/images/messenger_llama_arch.jpg
similarity index 100%
rename from demo_apps/messenger_llama_arch.jpg
rename to docs/images/messenger_llama_arch.jpg
diff --git a/demo_apps/whatsapp_dashboard.jpg b/docs/images/whatsapp_dashboard.jpg
similarity index 100%
rename from demo_apps/whatsapp_dashboard.jpg
rename to docs/images/whatsapp_dashboard.jpg
diff --git a/demo_apps/whatsapp_llama_arch.jpg b/docs/images/whatsapp_llama_arch.jpg
similarity index 100%
rename from demo_apps/whatsapp_llama_arch.jpg
rename to docs/images/whatsapp_llama_arch.jpg
diff --git a/docs/inference.md b/docs/inference.md
deleted file mode 100644
index 90330bfbf254874df9223b9ab726f03a48295dd2..0000000000000000000000000000000000000000
--- a/docs/inference.md
+++ /dev/null
@@ -1,168 +0,0 @@
-# Inference
-
-For inference we have provided an [inference script](../examples/inference.py). Depending on the type of finetuning performed during training the [inference script](../examples/inference.py) takes different arguments.
-To finetune all model parameters the output dir of the training has to be given as --model_name argument.
-In the case of a parameter efficient method like lora the base model has to be given as --model_name and the output dir of the training has to be given as --peft_model argument.
-Additionally, a prompt for the model in the form of a text file has to be provided. The prompt file can either be piped through standard input or given as --prompt_file parameter.
-
-**Content Safety**
-The inference script also supports safety checks for both user prompt and model outputs. In particular, we use two packages, [AuditNLG](https://github.com/salesforce/AuditNLG/tree/main) and [Azure content safety](https://pypi.org/project/azure-ai-contentsafety/1.0.0b1/).
-
-**Note**
-If using Azure content Safety, please make sure to get the endpoint and API key as described [here](https://pypi.org/project/azure-ai-contentsafety/1.0.0b1/) and add them as  the following environment variables,`CONTENT_SAFETY_ENDPOINT` and `CONTENT_SAFETY_KEY`.
-
-Examples:
-
- ```bash
-# Full finetuning of all parameters
-cat <test_prompt_file> | python examples/inference.py --model_name <training_config.output_dir> --use_auditnlg
-# PEFT method
-cat <test_prompt_file> | python examples/inference.py --model_name <training_config.model_name> --peft_model <training_config.output_dir> --use_auditnlg
-# prompt as parameter
-python examples/inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg
- ```
-The example folder contains test prompts for summarization use-case:
-```
-examples/samsum_prompt.txt
-...
-```
-
-**Note**
-Currently pad token by default in [HuggingFace Tokenizer is `None`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py#L110). We add the padding token as a special token to the tokenizer, which in this case requires to resize the token_embeddings as shown below:
-
-```python
-tokenizer.add_special_tokens(
-        {
-
-            "pad_token": "<PAD>",
-        }
-    )
-model.resize_token_embeddings(model.config.vocab_size + 1)
-```
-Padding would be required for batch inference. In this this [example](../examples/inference.py), batch size = 1 so essentially padding is not required. However,We added the code pointer as an example in case of batch inference.
-
-### Chat completion
-The inference folder also includes a chat completion example, that adds built-in safety features in fine-tuned models to the prompt tokens. To run the example:
-
-```bash
-python examples/chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file examples/chat_completion/chats.json  --quantization --use_auditnlg
-
-```
-### Code Llama
-
-Code llama was recently released with three flavors, base-model that support multiple programming languages, Python fine-tuned model and an instruction fine-tuned and aligned variation of Code Llama, please read more [here](https://ai.meta.com/blog/code-llama-large-language-model-coding/). Also note that the Python fine-tuned model and 34B models are not trained on infilling objective, hence can not be used for infilling use-case.
-
-Find the scripts to run Code Llama [here](../examples/code_llama/), where there are two examples of running code completion and infilling.
-
-**Note** Please find the right model on HF side [here](https://huggingface.co/codellama). 
-
-Make sure to install Transformers from source for now
-
-```bash
-
-pip install git+https://github.com/huggingface/transformers
-
-```
-
-To run the code completion example:
-
-```bash
-
-python examples/code_llama/code_completion_example.py --model_name MODEL_NAME  --prompt_file examples/code_llama/code_completion_prompt.txt --temperature 0.2 --top_p 0.9
-
-```
-
-To run the code infilling example:
-
-```bash
-
-python examples/code_llama/code_infilling_example.py --model_name MODEL_NAME --prompt_file examples/code_llama/code_infilling_prompt.txt --temperature 0.2 --top_p 0.9
-
-```
-To run the 70B Instruct model example run the following (you'll need to enter the system and user prompts to instruct the model):
-
-```bash
-
-python examples/code_llama/code_instruct_example.py --model_name codellama/CodeLlama-70b-Instruct-hf --temperature 0.2 --top_p 0.9
-
-```
-You can learn more about the chat prompt template [on HF](https://huggingface.co/codellama/CodeLlama-70b-Instruct-hf#chat-prompt) and [original Code Llama repository](https://github.com/facebookresearch/codellama/blob/main/README.md#fine-tuned-instruction-models). HF tokenizer has already taken care of the chat template as shown in this example. 
-
-### Llama Guard
-
-Llama Guard is a new experimental model that provides input and output guardrails for LLM deployments. For more details, please visit the main [repository](https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard).
-
-Find the inference script for Llama Guard [here](../examples/llama_guard/).
-
-**Note** Please find the right model on HF side [here](https://huggingface.co/meta-llama/LlamaGuard-7b). 
-
-Edit [inference.py](../examples/llama_guard/inference.py) to add test prompts for Llama Guard and execute it with this command:
-
-`python examples/llama_guard/inference.py`
-
-## Flash Attention and Xformer Memory Efficient Kernels
-
-Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up inference when used for batched inputs. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
-
-```bash
-python examples/chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file examples/chat_completion/chats.json  --quantization --use_auditnlg --use_fast_kernels
-
-python examples/inference.py --model_name <training_config.output_dir> --peft_model <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg --use_fast_kernels
-
-```
-
-## Loading back FSDP checkpoints
-
-In case you have fine-tuned your model with pure FSDP and saved the checkpoints with "SHARDED_STATE_DICT" as shown [here](../src/llama_recipes/configs/fsdp.py), you can use this converter script to convert the FSDP Sharded checkpoints into HuggingFace checkpoints. This enables you to use the inference script normally as mentioned above.
-**To convert the checkpoint use the following command**:
-
-This is helpful if you have fine-tuned you model using FSDP only as follows:
-
-```bash
-torchrun --nnodes 1 --nproc_per_node 8  examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16
-```
-Then convert your FSDP checkpoint to HuggingFace checkpoints using:
-```bash
- python -m llama_recipes.inference.checkpoint_converter_fsdp_hf --fsdp_checkpoint_path  PATH/to/FSDP/Checkpoints --consolidated_model_path PATH/to/save/checkpoints --HF_model_path_or_name PATH/or/HF/model_name
-
- # --HF_model_path_or_name specifies the HF Llama model name or path where it has config.json and tokenizer.json
- ```
-By default, training parameter are saved in `train_params.yaml` in the path where FSDP checkpoints are saved, in the converter script we frist try to find the HugingFace model name used in the fine-tuning to load the model with configs from there, if not found user need to provide it.
-
-Then run inference using:
-
-```bash
-python examples/inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> 
-
-```
-
-## Prompt Llama 2
-
-As outlined by [this blog by Hugging Face](https://huggingface.co/blog/llama2#how-to-prompt-llama-2), you can use the template below to prompt Llama 2 chat models. Review the [blog article](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) for more information.
-
-```
-<s>[INST] <<SYS>>
-{{ system_prompt }}
-<</SYS>>
-
-{{ user_message }} [/INST]
-
-```
-
-## Other Inference Options
-
-Alternate inference options include:
-
-[**vLLM**](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html):
-To use vLLM you will need to install it using the instructions [here](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#installation).
-Once installed, you can use the vllm/inference.py script provided [here](../examples/vllm/inference.py).
-
-Below is an example of how to run the vLLM_inference.py script found within the inference folder.
-
-``` bash
-python examples/vllm/inference.py --model_name <PATH/TO/MODEL/7B>
-```
-
-[**TGI**](https://github.com/huggingface/text-generation-inference): Text Generation Inference (TGI) is another inference option available to you. For more information on how to set up and use TGI see [here](../examples/hf_text_generation_inference/README.md).
-
-[Here](../demo_apps/llama-on-prem.md) is a complete tutorial on how to use vLLM and TGI to deploy Llama 2 on-prem and interact with the Llama API services.
diff --git a/docs/multi_gpu.md b/docs/multi_gpu.md
index 0e961bb3e90a3b61bf3acaabdc88ebabbe592440..954fae5c0f50fcec089c22ed7a85fe601938e332 100644
--- a/docs/multi_gpu.md
+++ b/docs/multi_gpu.md
@@ -9,7 +9,7 @@ To run fine-tuning on multi-GPUs, we will  make use of two packages:
 Given the combination of PEFT and FSDP, we would be able to fine tune a Llama 2 model on multiple GPUs in one node or multi-node.
 
 ## Requirements 
-To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`examples/finetuning.py`](../examples/finetuning.py) script with torchrun (See [README.md](../README.md) for details).
+To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`finetuning.py`](../recipes/finetuning/finetuning.py) script with torchrun (See [README.md](../README.md) for details).
 
 **Please note that the llama_recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
 
diff --git a/examples/README.md b/examples/README.md
deleted file mode 100644
index 2c544d96ace886f834a9a3aca3d768e6672029b6..0000000000000000000000000000000000000000
--- a/examples/README.md
+++ /dev/null
@@ -1,41 +0,0 @@
-# Examples
-
-This folder contains finetuning and inference examples for Llama 2, Code Llama and (Purple Llama](https://ai.meta.com/llama/purple-llama/). For the full documentation on these examples please refer to [docs/inference.md](../docs/inference.md)
-
-## Finetuning
-
-Please refer to the main [README.md](../README.md) for information on how to use the [finetuning.py](./finetuning.py) script.
-After installing the llama-recipes package through [pip](../README.md#installation) you can also invoke the finetuning in two ways:
-```
-python -m llama_recipes.finetuning <parameters>
-
-python examples/finetuning.py <parameters>
-```
-Please see [README.md](../README.md) for details.
-
-## Inference
-So far, we have provide the following inference examples:
-
-1. [inference script](./inference.py) script provides support for Hugging Face accelerate, PEFT and FSDP fine tuned models. It also demonstrates safety features to protect the user from toxic or harmful content.
-
-2. [vllm/inference.py](./vllm/inference.py) script takes advantage of vLLM's paged attention concept for low latency.
-
-3. The [hf_text_generation_inference](./hf_text_generation_inference/README.md) folder contains information on Hugging Face Text Generation Inference (TGI).
-
-4. A [chat completion](./chat_completion/chat_completion.py) example highlighting the handling of chat dialogs.
-
-5. [Code Llama](./code_llama/) folder which provides examples for [code completion](./code_llama/code_completion_example.py), [code infilling](./code_llama/code_infilling_example.py) and [Llama2 70B code instruct](./code_llama/code_instruct_example.py).
-
-6. The [Purple Llama Using Anyscale](./Purple_Llama_Anyscale.ipynb) and the [Purple Llama Using OctoAI](./Purple_Llama_OctoAI.ipynb) are notebooks that shows how to use Llama Guard model on Anyscale and OctoAI to classify user inputs as safe or unsafe.
-
-7. [Llama Guard](./llama_guard/) inference example and [safety_checker](../src/llama_recipes/inference/safety_utils.py) for the main [inference](./inference.py) script. The standalone scripts allows to test Llama Guard on user input, or user input and agent response pairs. The safety_checker integration providers a way to integrate Llama Guard on all inference executions, both for the user input and model output.
-
-For more in depth information on inference including inference safety checks and examples, see the inference documentation [here](../docs/inference.md).
-
-**Note** The [sensitive topics safety checker](../src/llama_recipes/inference/safety_utils.py) utilizes AuditNLG which is an optional dependency. Please refer to installation section of the main [README.md](../README.md#install-with-optional-dependencies) for details.
-
-**Note** The **vLLM** example requires additional dependencies. Please refer to installation section of the main [README.md](../README.md#install-with-optional-dependencies) for details.
-
-## Train on custom dataset
-To show how to train a model on a custom dataset we provide an example to generate a custom dataset in [custom_dataset.py](./custom_dataset.py).
-The usage of the custom dataset is further described in the datasets [README](../docs/Dataset.md#training-on-custom-data).
diff --git a/recipes/README.md b/recipes/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c3d91d19c8f51fbbb187dc4f1cfb6cd53c740d47
--- /dev/null
+++ b/recipes/README.md
@@ -0,0 +1,23 @@
+This folder contains examples organized by topic:
+
+| Subfolder | Description |
+|---|---|
+[quickstart](./quickstart) | The "Hello World" of using Llama2, start here if you are new to using Llama2.
+[finetuning](./finetuning)|Scripts to finetune Llama2 on single-GPU and multi-GPU setups
+[inference](./inference)|Scripts to deploy Llama2 for inference locally and using model servers
+[use_cases](./use_cases)|Scripts showing common applications of Llama2
+[responsible_ai](./responsible_ai)|Scripts to use PurpleLlama for safeguarding model outputs
+[llama_api_providers](./llama_api_providers)|Scripts to run inference on Llama via hosted endpoints
+[benchmarks](./benchmarks)|Scripts to benchmark Llama 2 models inference on various backends
+[code_llama](./code_llama)|Scripts to run inference with the Code Llama models
+[evaluation](./evaluation)|Scripts to evaluate fine-tuned Llama2 models using `lm-evaluation-harness` from `EleutherAI`
+
+
+**<a id="replicate_note">Note on using Replicate</a>**
+To run some of the demo apps here, you'll need to first sign in with Replicate with your github account, then create a free API token [here](https://replicate.com/account/api-tokens) that you can use for a while. After the free trial ends, you'll need to enter billing info to continue to use Llama2 hosted on Replicate - according to Replicate's [Run time and cost](https://replicate.com/meta/llama-2-13b-chat) for the Llama2-13b-chat model used in our demo apps, the model "costs $0.000725 per second. Predictions typically complete within 10 seconds." This means each call to the Llama2-13b-chat model costs less than $0.01 if the call completes within 10 seconds. If you want absolutely no costs, you can refer to the section "Running Llama2 locally on Mac" above or the "Running Llama2 in Google Colab" below.
+
+**<a id="octoai_note">Note on using OctoAI</a>**
+You can also use [OctoAI](https://octo.ai/) to run some of the Llama demos under [OctoAI_API_examples](./llama_api_providers/OctoAI_API_examples/). You can sign into [OctoAI](https://octoai.cloud) with your Google or GitHub account, which will give you $10 of free credits you can use for a month. Llama2 on OctoAI is priced at [$0.00086 per 1k tokens](https://octo.ai/pricing/) (a ~350-word LLM response), so $10 of free credits should go a very long way (about 10,000 LLM inferences).
+
+### [Running Llama2 in Google Colab](https://colab.research.google.com/drive/1-uBXt4L-6HNS2D8Iny2DwUpVS4Ub7jnk?usp=sharing)
+To run Llama2 in Google Colab using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), download the quantized Llama2-7b-chat model [here](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_0.gguf), or follow the instructions above to build it, before uploading it to your Google drive. Note that on the free Colab T4 GPU, the call to Llama could take more than 20 minutes to return; running the notebook locally on M1 MBP takes about 20 seconds.
\ No newline at end of file
diff --git a/benchmarks/inference/README.md b/recipes/benchmarks/inference/README.md
similarity index 100%
rename from benchmarks/inference/README.md
rename to recipes/benchmarks/inference/README.md
diff --git a/benchmarks/inference/on-prem/README.md b/recipes/benchmarks/inference/on-prem/README.md
similarity index 100%
rename from benchmarks/inference/on-prem/README.md
rename to recipes/benchmarks/inference/on-prem/README.md
diff --git a/benchmarks/inference/on-prem/vllm/chat_vllm_benchmark.py b/recipes/benchmarks/inference/on-prem/vllm/chat_vllm_benchmark.py
similarity index 100%
rename from benchmarks/inference/on-prem/vllm/chat_vllm_benchmark.py
rename to recipes/benchmarks/inference/on-prem/vllm/chat_vllm_benchmark.py
diff --git a/benchmarks/inference/on-prem/vllm/input.jsonl b/recipes/benchmarks/inference/on-prem/vllm/input.jsonl
similarity index 100%
rename from benchmarks/inference/on-prem/vllm/input.jsonl
rename to recipes/benchmarks/inference/on-prem/vllm/input.jsonl
diff --git a/benchmarks/inference/on-prem/vllm/parameters.json b/recipes/benchmarks/inference/on-prem/vllm/parameters.json
similarity index 100%
rename from benchmarks/inference/on-prem/vllm/parameters.json
rename to recipes/benchmarks/inference/on-prem/vllm/parameters.json
diff --git a/benchmarks/inference/on-prem/vllm/pretrained_vllm_benchmark.py b/recipes/benchmarks/inference/on-prem/vllm/pretrained_vllm_benchmark.py
similarity index 100%
rename from benchmarks/inference/on-prem/vllm/pretrained_vllm_benchmark.py
rename to recipes/benchmarks/inference/on-prem/vllm/pretrained_vllm_benchmark.py
diff --git a/benchmarks/inference/tokenizer/special_tokens_map.json b/recipes/benchmarks/inference/tokenizer/special_tokens_map.json
similarity index 100%
rename from benchmarks/inference/tokenizer/special_tokens_map.json
rename to recipes/benchmarks/inference/tokenizer/special_tokens_map.json
diff --git a/benchmarks/inference/tokenizer/tokenizer.json b/recipes/benchmarks/inference/tokenizer/tokenizer.json
similarity index 100%
rename from benchmarks/inference/tokenizer/tokenizer.json
rename to recipes/benchmarks/inference/tokenizer/tokenizer.json
diff --git a/benchmarks/inference/tokenizer/tokenizer.model b/recipes/benchmarks/inference/tokenizer/tokenizer.model
similarity index 100%
rename from benchmarks/inference/tokenizer/tokenizer.model
rename to recipes/benchmarks/inference/tokenizer/tokenizer.model
diff --git a/benchmarks/inference/tokenizer/tokenizer_config.json b/recipes/benchmarks/inference/tokenizer/tokenizer_config.json
similarity index 100%
rename from benchmarks/inference/tokenizer/tokenizer_config.json
rename to recipes/benchmarks/inference/tokenizer/tokenizer_config.json
diff --git a/benchmarks/inference_throughput/cloud-api/README.md b/recipes/benchmarks/inference_throughput/cloud-api/README.md
similarity index 100%
rename from benchmarks/inference_throughput/cloud-api/README.md
rename to recipes/benchmarks/inference_throughput/cloud-api/README.md
diff --git a/benchmarks/inference_throughput/cloud-api/azure/chat_azure_api_benchmark.py b/recipes/benchmarks/inference_throughput/cloud-api/azure/chat_azure_api_benchmark.py
similarity index 100%
rename from benchmarks/inference_throughput/cloud-api/azure/chat_azure_api_benchmark.py
rename to recipes/benchmarks/inference_throughput/cloud-api/azure/chat_azure_api_benchmark.py
diff --git a/benchmarks/inference_throughput/cloud-api/azure/input.jsonl b/recipes/benchmarks/inference_throughput/cloud-api/azure/input.jsonl
similarity index 100%
rename from benchmarks/inference_throughput/cloud-api/azure/input.jsonl
rename to recipes/benchmarks/inference_throughput/cloud-api/azure/input.jsonl
diff --git a/benchmarks/inference_throughput/cloud-api/azure/parameters.json b/recipes/benchmarks/inference_throughput/cloud-api/azure/parameters.json
similarity index 100%
rename from benchmarks/inference_throughput/cloud-api/azure/parameters.json
rename to recipes/benchmarks/inference_throughput/cloud-api/azure/parameters.json
diff --git a/benchmarks/inference_throughput/cloud-api/azure/pretrained_azure_api_benchmark.py b/recipes/benchmarks/inference_throughput/cloud-api/azure/pretrained_azure_api_benchmark.py
similarity index 100%
rename from benchmarks/inference_throughput/cloud-api/azure/pretrained_azure_api_benchmark.py
rename to recipes/benchmarks/inference_throughput/cloud-api/azure/pretrained_azure_api_benchmark.py
diff --git a/benchmarks/inference_throughput/requirements.txt b/recipes/benchmarks/inference_throughput/requirements.txt
similarity index 100%
rename from benchmarks/inference_throughput/requirements.txt
rename to recipes/benchmarks/inference_throughput/requirements.txt
diff --git a/recipes/code_llama/README.md b/recipes/code_llama/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d5f4bda52e2576d9e6fc33d049129d7c1ec0e54d
--- /dev/null
+++ b/recipes/code_llama/README.md
@@ -0,0 +1,39 @@
+# Code Llama
+
+Code llama was recently released with three flavors, base-model that support multiple programming languages, Python fine-tuned model and an instruction fine-tuned and aligned variation of Code Llama, please read more [here](https://ai.meta.com/blog/code-llama-large-language-model-coding/). Also note that the Python fine-tuned model and 34B models are not trained on infilling objective, hence can not be used for infilling use-case.
+
+Find the scripts to run Code Llama, where there are two examples of running code completion and infilling.
+
+**Note** Please find the right model on HF side [here](https://huggingface.co/codellama). 
+
+Make sure to install Transformers from source for now
+
+```bash
+
+pip install git+https://github.com/huggingface/transformers
+
+```
+
+To run the code completion example:
+
+```bash
+
+python code_completion_example.py --model_name MODEL_NAME  --prompt_file code_completion_prompt.txt --temperature 0.2 --top_p 0.9
+
+```
+
+To run the code infilling example:
+
+```bash
+
+python code_infilling_example.py --model_name MODEL_NAME --prompt_file code_infilling_prompt.txt --temperature 0.2 --top_p 0.9
+
+```
+To run the 70B Instruct model example run the following (you'll need to enter the system and user prompts to instruct the model):
+
+```bash
+
+python code_instruct_example.py --model_name codellama/CodeLlama-70b-Instruct-hf --temperature 0.2 --top_p 0.9
+
+```
+You can learn more about the chat prompt template [on HF](https://huggingface.co/codellama/CodeLlama-70b-Instruct-hf#chat-prompt) and [original Code Llama repository](https://github.com/facebookresearch/codellama/blob/main/README.md#fine-tuned-instruction-models). HF tokenizer has already taken care of the chat template as shown in this example. 
diff --git a/examples/code_llama/code_completion_example.py b/recipes/code_llama/code_completion_example.py
similarity index 100%
rename from examples/code_llama/code_completion_example.py
rename to recipes/code_llama/code_completion_example.py
diff --git a/examples/code_llama/code_completion_prompt.txt b/recipes/code_llama/code_completion_prompt.txt
similarity index 100%
rename from examples/code_llama/code_completion_prompt.txt
rename to recipes/code_llama/code_completion_prompt.txt
diff --git a/examples/code_llama/code_infilling_example.py b/recipes/code_llama/code_infilling_example.py
similarity index 100%
rename from examples/code_llama/code_infilling_example.py
rename to recipes/code_llama/code_infilling_example.py
diff --git a/examples/code_llama/code_infilling_prompt.txt b/recipes/code_llama/code_infilling_prompt.txt
similarity index 100%
rename from examples/code_llama/code_infilling_prompt.txt
rename to recipes/code_llama/code_infilling_prompt.txt
diff --git a/examples/code_llama/code_instruct_example.py b/recipes/code_llama/code_instruct_example.py
similarity index 100%
rename from examples/code_llama/code_instruct_example.py
rename to recipes/code_llama/code_instruct_example.py
diff --git a/eval/README.md b/recipes/evaluation/README.md
similarity index 100%
rename from eval/README.md
rename to recipes/evaluation/README.md
diff --git a/eval/eval.py b/recipes/evaluation/eval.py
similarity index 100%
rename from eval/eval.py
rename to recipes/evaluation/eval.py
diff --git a/eval/open_llm_eval_prep.sh b/recipes/evaluation/open_llm_eval_prep.sh
similarity index 100%
rename from eval/open_llm_eval_prep.sh
rename to recipes/evaluation/open_llm_eval_prep.sh
diff --git a/eval/open_llm_leaderboard/arc_challeneg_25shots.yaml b/recipes/evaluation/open_llm_leaderboard/arc_challeneg_25shots.yaml
similarity index 100%
rename from eval/open_llm_leaderboard/arc_challeneg_25shots.yaml
rename to recipes/evaluation/open_llm_leaderboard/arc_challeneg_25shots.yaml
diff --git a/eval/open_llm_leaderboard/hellaswag_10shots.yaml b/recipes/evaluation/open_llm_leaderboard/hellaswag_10shots.yaml
similarity index 100%
rename from eval/open_llm_leaderboard/hellaswag_10shots.yaml
rename to recipes/evaluation/open_llm_leaderboard/hellaswag_10shots.yaml
diff --git a/eval/open_llm_leaderboard/hellaswag_utils.py b/recipes/evaluation/open_llm_leaderboard/hellaswag_utils.py
similarity index 100%
rename from eval/open_llm_leaderboard/hellaswag_utils.py
rename to recipes/evaluation/open_llm_leaderboard/hellaswag_utils.py
diff --git a/eval/open_llm_leaderboard/mmlu_5shots.yaml b/recipes/evaluation/open_llm_leaderboard/mmlu_5shots.yaml
similarity index 100%
rename from eval/open_llm_leaderboard/mmlu_5shots.yaml
rename to recipes/evaluation/open_llm_leaderboard/mmlu_5shots.yaml
diff --git a/eval/open_llm_leaderboard/winogrande_5shots.yaml b/recipes/evaluation/open_llm_leaderboard/winogrande_5shots.yaml
similarity index 100%
rename from eval/open_llm_leaderboard/winogrande_5shots.yaml
rename to recipes/evaluation/open_llm_leaderboard/winogrande_5shots.yaml
diff --git a/recipes/finetuning/LLM_finetuning_overview.md b/recipes/finetuning/LLM_finetuning_overview.md
new file mode 100644
index 0000000000000000000000000000000000000000..ec4ef06e06c19879ea6de382cd38eaca77cf0720
--- /dev/null
+++ b/recipes/finetuning/LLM_finetuning_overview.md
@@ -0,0 +1,66 @@
+## LLM Fine-Tuning
+
+Here we discuss fine-tuning Llama 2 with a couple of different recipes. We will cover two scenarios here:
+
+
+## 1. **Parameter Efficient Model Fine-Tuning**
+ This helps make the fine-tuning process more affordable even on 1 consumer grade GPU. These methods enable us to keep the whole model frozen and to just add tiny learnable parameters/ layers into the model. In this way, we just train a very tiny portion of the parameters. The most famous method in this category is [LORA](https://arxiv.org/pdf/2106.09685.pdf), Llama Adapter and Prefix-tuning.
+
+
+These methods will address three aspects:
+
+
+- **Cost of full fine-tuning** â€“ these methods only train a small set of extra parameters instead of the full model, this makes it possible to run these on consumer GPUs.
+
+- **Cost of deployment** â€“ for each fine-tuned downstream model we need to deploy a separate model; however, when using these methods, only a small set of parameters (few MB instead of several GBs) of the pretrained model can do the job. In this case, for each task we only add these extra parameters on top of the pretrained model so pretrained models can be assumed as backbone and these parameters as heads for the model on different tasks.
+
+- **Catastrophic forgetting** â€” these methods also help with forgetting the first task that can happen in finetuning.
+
+HF [PEFT](https://github.com/huggingface/peft) library provides an easy way of using these methods which we make use of here. Please read more [here](https://huggingface.co/blog/peft).
+
+
+
+## 2. **Full/ Partial Parameter Fine-Tuning**
+
+Full parameter fine-tuning has its own advantages, in this method there are multiple strategies that can help:
+
+-  Keep the pretrained model frozen and only fine-tune the task head for example, the classifier model.
+
+
+-  Keep the pretrained model frozen and add a few fully connected layers on the top.
+
+
+-  Fine-tuning on all the layers.
+
+You can also keep most of the layers frozen and only fine-tune a few layers. There are many different techniques to choose from to freeze/unfreeze layers based on different criteria.
+
+<div style="display: flex;">
+    <img src="../../docs/images/feature-based_FN.png" alt="Image 1" width="250" />
+    <img src="../../docs/images/feature-based_FN_2.png" alt="Image 2" width="250" />
+    <img src="../../docs/images/full-param-FN.png" alt="Image 3" width="250" />
+</div>
+
+
+
+In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Llama 2 7B parameter won't fit into one gpu.
+The way you want to think about it is, you would need enough GPU memory to keep model parameters, gradients and optimizer states. Where each of these, depending on the precision you are training, can take up multiple times of your parameter count x precision( depending on if its fp32/ 4 bytes, fp16/2 bytes/ bf16/2 bytes).
+For example AdamW optimizer keeps 2 parameters for each of your parameters and in many cases these are kept in fp32. This implies that depending on how many layers you are training/ unfreezing your GPU memory can grow beyond one GPU.
+
+**FSDP (Fully Sharded Data Parallel)**
+
+
+Pytorch has the FSDP package for training models that do not fit into one GPU. FSDP lets you train a much larger model with the same amount of resources. Prior to FSDP was DDP (Distributed Data Parallel) where each GPU was holding a full replica of the model and would only shard the data. At the end of backward pass it would sync up the gradients.
+
+FSDP extends this idea, not only sharding the data but also model parameters, gradients and optimizer states. This means each GPU will only keep one shard of the model. This will result in huge memory savings that enable us to fit a much larger model into the same number of GPU. As an example in DDP the most you could fit into a GPU with 16GB memory is a model around 700M parameters. So, suppose you had 4 GPUs, in this case even though you access 4 GPUs, you still can't scale beyond the model size that can fit into one GPU. However with FSDP you can fit a 3B model into 4 GPUs, > 4x larger model.
+
+
+Please read more on FSDP [here](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) & get started with FSDP [here](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html).
+
+
+To boost the performance of fine-tuning with FSDP, we can make use a number of features such as:
+
+- **Mixed Precision** which in FSDP is much more flexible compared to Autocast. It gives user control over setting precision for model parameters, buffers and gradients.
+
+- **Activation Checkpointing**  which is a technique to save memory by discarding the intermediate activation in forward pass instead of keeping it in the memory with the cost recomputing them in the backward pass. FSDP Activation checkpointing is shard aware meaning we need to apply it after wrapping the model with FSDP. In our script we are making use of that.
+
+- **auto_wrap_policy** Which is the way to specify how FSDP would partition the model, there is default support for transformer wrapping policy. This allows FSDP to form each FSDP unit ( partition of the  model ) based on the transformer class in the model. To identify this layer in the model, you need to look at the layer that wraps both the attention layer and  MLP. This helps FSDP have more fine-grained units for communication that help with optimizing the communication cost.
diff --git a/recipes/finetuning/README.md b/recipes/finetuning/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..07219cf711fc86d076ac8f0af7f4a0da2588def0
--- /dev/null
+++ b/recipes/finetuning/README.md
@@ -0,0 +1,90 @@
+# Finetuning Llama
+
+This folder contains instructions to fine-tune Llama 2 on a 
+* [single-GPU setup](./singlegpu_finetuning.md)
+* [multi-GPU setup](./multigpu_finetuning.md) 
+
+using the canonical [finetuning script](../../src/llama_recipes/finetuning.py) in the llama-recipes package.
+
+If you are new to fine-tuning techniques, check out an overview: [](./LLM_finetuning_overview.md)
+
+> [!TIP]
+> If you want to try finetuning Llama 2 with Huggingface's trainer, here is a Jupyter notebook with an [example](./huggingface_trainer/peft_finetuning.ipynb)
+
+
+## How to configure finetuning settings?
+
+> [!TIP]
+> All the setting defined in [config files](../../src/llama_recipes/configs/) can be passed as args through CLI when running the script, there is no need to change from config files directly.
+
+
+* [Training config file](../../src/llama_recipes/configs/training.py) is the main config file that helps to specify the settings for our run and can be found in [configs folder](../../src/llama_recipes/configs/)
+
+It lets us specify the training settings for everything from `model_name` to `dataset_name`, `batch_size` and so on. Below is the list of supported settings:
+
+```python
+
+model_name: str="PATH/to/LLAMA 2/7B"
+enable_fsdp: bool= False
+run_validation: bool=True
+batch_size_training: int=4
+gradient_accumulation_steps: int=1
+num_epochs: int=3
+num_workers_dataloader: int=2
+lr: float=2e-4
+weight_decay: float=0.0
+gamma: float= 0.85
+use_fp16: bool=False
+mixed_precision: bool=True
+val_batch_size: int=4
+dataset = "samsum_dataset" # alpaca_dataset, grammar_dataset
+peft_method: str = "lora" # None , llama_adapter, prefix
+use_peft: bool=False
+output_dir: str = "./ft-output"
+freeze_layers: bool = False
+num_freeze_layers: int = 1
+quantization: bool = False
+save_model: bool = False
+dist_checkpoint_root_folder: str="model_checkpoints"
+dist_checkpoint_folder: str="fine-tuned"
+save_optimizer: bool=False
+
+```
+
+* [Datasets config file](../../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
+
+* [peft config file](../../src/llama_recipes/configs/peft.py) provides the supported PEFT methods and respective settings that can be modified.
+
+* [FSDP config file](../../src/llama_recipes/configs/fsdp.py) provides FSDP settings such as:
+
+    * `mixed_precision` boolean flag to specify using mixed precision, defatults to true.
+
+    * `use_fp16` boolean flag to specify using FP16 for mixed precision, defatults to False. We recommond not setting this flag, and only set `mixed_precision` that will use `BF16`, this will help with speed and memory savings while avoiding challenges of scaler accuracies with `FP16`.
+
+    *  `sharding_strategy` this specifies the sharding strategy for FSDP, it can be:
+        * `FULL_SHARD` that shards model parameters, gradients and optimizer states, results in the most memory savings.
+
+        * `SHARD_GRAD_OP` that shards gradinets and optimizer states and keeps the parameters after the first `all_gather`. This reduces communication overhead specially if you are using slower networks more specifically beneficial on multi-node cases. This comes with the trade off of higher memory consumption.
+
+        * `NO_SHARD` this is equivalent to DDP, does not shard model parameters, gradinets or optimizer states. It keeps the full parameter after the first `all_gather`.
+
+        * `HYBRID_SHARD` available on PyTorch Nightlies. It does FSDP within a node and DDP between nodes. It's for multi-node cases and helpful for slower networks, given your model will fit into one node.
+
+* `checkpoint_type` specifies the state dict checkpoint type for saving the model. `FULL_STATE_DICT` streams state_dict of each model shard from a rank to CPU and assembels the full state_dict on CPU. `SHARDED_STATE_DICT` saves one checkpoint per rank, and enables the re-loading the model in a different world size.
+
+* `fsdp_activation_checkpointing` enables activation checkpoining for FSDP, this saves significant amount of memory with the trade off of recomputing itermediate activations during the backward pass. The saved memory can be re-invested in higher batch sizes to increase the throughput. We recommond you use this option.
+
+* `pure_bf16` it moves the  model to `BFloat16` and if `optimizer` is set to `anyprecision` then optimizer states will be kept in `BFloat16` as well. You can use this option if necessary.
+
+
+## Weights & Biases Experiment Tracking
+
+You can enable [W&B](https://wandb.ai/) experiment tracking by using `use_wandb` flag as below. You can change the project name, entity and other `wandb.init` arguments in `wandb_config`.
+
+```bash
+python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model --use_wandb
+```
+You'll be able to access a dedicated project or run link on [wandb.ai](https://wandb.ai) and see your dashboard like the one below. 
+<div style="display: flex;">
+    <img src="../../docs/images/wandb_screenshot.png" alt="wandb screenshot" width="500" />
+</div>
diff --git a/docs/Dataset.md b/recipes/finetuning/datasets/README.md
similarity index 86%
rename from docs/Dataset.md
rename to recipes/finetuning/datasets/README.md
index f2819862a256c609b7c7905cfdb291f57aa22b06..ea2847f73bece6ab48ef048a3ec450ea4fecbff1 100644
--- a/docs/Dataset.md
+++ b/recipes/finetuning/datasets/README.md
@@ -1,6 +1,6 @@
 # Datasets and Evaluation Metrics
 
-The provided fine tuning script allows you to select between three datasets by passing the `dataset` arg to the `llama_recipes.finetuning` module or `examples/finetuning.py` script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Additionally, we integrate the OpenAssistant/oasst1 dataset as an [example for a custom dataset](../examples/custom_dataset.py) Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
+The provided fine tuning scripts allows you to select between three datasets by passing the `dataset` arg to the `llama_recipes.finetuning` module or [`recipes/finetuning/finetuning.py`](../finetuning.py) script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Additionally, we integrate the OpenAssistant/oasst1 dataset as an [example for a custom dataset](custom_dataset.py) Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
 
 * [grammar_dataset](https://huggingface.co/datasets/jfleg) contains 150K pairs of english sentences and possible corrections.
 * [alpaca_dataset](https://github.com/tatsu-lab/stanford_alpaca) provides 52K instruction-response pairs as generated by `text-davinci-003`.
@@ -32,7 +32,7 @@ To supply a custom dataset you need to provide a single .py file which contains
 ```@python
 def get_custom_dataset(dataset_config, tokenizer, split: str):
 ```
-For an example `get_custom_dataset` you can look at the provided datasets in llama_recipes.datasets or [examples/custom_dataset.py](../examples/custom_dataset.py).
+For an example `get_custom_dataset` you can look at the provided datasets in llama_recipes.datasets or [examples/custom_dataset.py](custom_dataset.py).
 The `dataset_config` in the above signature will be an instance of llama_recipes.configs.dataset.custom_dataset with the modifications made through the command line.
 The split signals wether to return the training or validation dataset.
 The default function name is `get_custom_dataset` but this can be changed as described below.
@@ -48,17 +48,17 @@ python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.f
 This will call the function `get_foo` instead of `get_custom_dataset` when retrieving the dataset.
 
 ### Adding new dataset
-Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../src/llama_recipes/configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
+Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../../../src/llama_recipes/configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
 
-Additionally, there is a preprocessing function for each dataset in the [datasets](../src/llama_recipes/datasets) folder.
+Additionally, there is a preprocessing function for each dataset in the [datasets](../../../src/llama_recipes/datasets) folder.
 The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling ```model(**data)```.
 For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields.
 
 To add a custom dataset the following steps need to be performed.
 
-1. Create a dataset configuration after the schema described above. Examples can be found in [configs/datasets.py](../src/llama_recipes/configs/datasets.py).
+1. Create a dataset configuration after the schema described above. Examples can be found in [configs/datasets.py](../../../src/llama_recipes/configs/datasets.py).
 2. Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.
-3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [utils/dataset_utils.py](../src/llama_recipes/utils/dataset_utils.py)
+3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in [utils/dataset_utils.py](../../../src/llama_recipes/utils/dataset_utils.py)
 4. Set dataset field in training config to dataset name or use --dataset option of the `llama_recipes.finetuning` module or examples/finetuning.py training script.
 
 ## Application
diff --git a/examples/custom_dataset.py b/recipes/finetuning/datasets/custom_dataset.py
similarity index 100%
rename from examples/custom_dataset.py
rename to recipes/finetuning/datasets/custom_dataset.py
diff --git a/examples/finetuning.py b/recipes/finetuning/finetuning.py
similarity index 100%
rename from examples/finetuning.py
rename to recipes/finetuning/finetuning.py
diff --git a/examples/quickstart.ipynb b/recipes/finetuning/huggingface_trainer/peft_finetuning.ipynb
similarity index 100%
rename from examples/quickstart.ipynb
rename to recipes/finetuning/huggingface_trainer/peft_finetuning.ipynb
diff --git a/examples/multi_node.slurm b/recipes/finetuning/multi_node.slurm
similarity index 90%
rename from examples/multi_node.slurm
rename to recipes/finetuning/multi_node.slurm
index cbdfedb00eeeca345bfbf4bc280ae6bdb9224a7b..e8aba3f4cc8df849525f3f7af3f4ccc723a81816 100644
--- a/examples/multi_node.slurm
+++ b/recipes/finetuning/multi_node.slurm
@@ -32,5 +32,5 @@ export CUDA_LAUNCH_BLOCKING=0
 export NCCL_SOCKET_IFNAME="ens"
 export FI_EFA_USE_DEVICE_RDMA=1
 
-srun  torchrun --nproc_per_node 4 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 examples/finetuning.py  --enable_fsdp --use_peft --peft_method lora
+srun  torchrun --nproc_per_node 4 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 ./finetuning.py  --enable_fsdp --use_peft --peft_method lora
 
diff --git a/recipes/finetuning/multigpu_finetuning.md b/recipes/finetuning/multigpu_finetuning.md
new file mode 100644
index 0000000000000000000000000000000000000000..f938ac71cccc46b34aee296cfabaaf315a27c529
--- /dev/null
+++ b/recipes/finetuning/multigpu_finetuning.md
@@ -0,0 +1,111 @@
+# Fine-tuning with Multi GPU
+This recipe steps you through how to finetune a Llama 2 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on multiple GPUs in a single or across multiple nodes.
+
+
+## Requirements
+Ensure that you have installed the llama-recipes package ([details](../../README.md#installing)).
+
+We will also need 2 packages:
+1. [PEFT](https://github.com/huggingface/peft) to use parameter-efficient finetuning.
+2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](./LLM_finetuning_overview.md#2-full-partial-parameter-finetuning).
+
+> [!NOTE]  
+> The llama-recipes package will install PyTorch 2.0.1 version. In case you want to use FSDP with PEFT for multi GPU finetuning, please install the PyTorch nightlies ([details](../../README.md#pytorch-nightlies))
+>
+> INT8 quantization is not currently supported in FSDP
+
+
+## How to run it
+Get access to a machine with multiple GPUs (in this case we tested with 4 A100 and A10s).
+
+### With FSDP + PEFT
+
+<details open>
+<summary>Single-node Multi-GPU</summary>
+
+    torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
+
+</details>
+
+<details>
+<summary>Multi-node Multi-GPU</summary>
+Here we use a slurm script to schedule a job with slurm over multiple nodes.
+    
+    # Change the num nodes and GPU per nodes in the script before running.
+    sbatch ./multi_node.slurm
+
+</details>
+
+
+We use `torchrun` to spawn multiple processes for FSDP.
+
+The args used in the command above are:
+* `--enable_fsdp` boolean flag to enable FSDP  in the script
+* `--use_peft` boolean flag to enable PEFT methods in the script
+* `--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`, `prefix`.
+
+
+### With only FSDP
+If interested in running full parameter finetuning without making use of PEFT methods, please use the following command. Make sure to change the `nproc_per_node` to your available GPUs. This has been tested with `BF16` on 8xA100, 40GB GPUs.
+
+```bash
+torchrun --nnodes 1 --nproc_per_node 8  finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --use_fast_kernels
+```
+
+### Using less CPU memory (FSDP on 70B model)
+
+If you are running full parameter fine-tuning on the 70B model, you can enable `low_cpu_fsdp` mode as the following command. This option will load model on rank0 only before moving model to devices to construct FSDP. This can dramatically save cpu memory when loading large models like 70B (on a 8-gpu node, this reduces cpu memory from 2+T to 280G for 70B model). This has been tested with `BF16` on 16xA100, 80GB GPUs.
+
+```bash
+torchrun --nnodes 1 --nproc_per_node 8 finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
+```
+
+
+
+## Running with different datasets
+Currently 3 open source datasets are supported that can be found in [Datasets config file](../../src/llama_recipes/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).
+
+* `grammar_dataset` : use this [notebook](../../src/llama_recipes/datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process the Jfleg and C4 200M datasets for grammar checking.
+
+* `alpaca_dataset` : to get this open source data please download the `aplaca.json` to `dataset` folder.
+
+```bash
+wget -P ../../src/llama_recipes/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
+```
+
+* `samsum_dataset`
+
+To run with each of the datasets set the `dataset` flag in the command as shown below:
+
+```bash
+# grammer_dataset
+torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp  --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset grammar_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned  --pure_bf16 --output_dir Path/to/save/PEFT/model
+
+# alpaca_dataset
+
+torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp  --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset alpaca_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
+
+
+# samsum_dataset
+
+torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset samsum_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
+
+```
+
+
+
+## [TIP] Slow interconnect between nodes?
+In case you are dealing with slower interconnect network between nodes, to reduce the communication overhead you can make use of `--hsdp` flag. 
+
+HSDP (Hybrid sharding Data Parallel) helps to define a hybrid sharding strategy where you can have FSDP within `sharding_group_size` which can be the minimum number of GPUs you can fit your model and DDP between the replicas of the model specified by `replica_group_size`.
+
+This will require to set the Sharding strategy in [fsdp config](../../src/llama_recipes/configs/fsdp.py) to `ShardingStrategy.HYBRID_SHARD` and specify two additional settings, `sharding_group_size` and `replica_group_size` where former specifies the sharding group size, number of GPUs that you model can fit into to form a replica of a model and latter specifies the replica group size, which is world_size/sharding_group_size.
+
+```bash
+
+torchrun --nnodes 4 --nproc_per_node 8 ./finetuning.py --enable_fsdp --low_cpu_fsdp --fsdp_config.pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --hsdp --sharding_group_size n --replica_group_size world_size/n
+
+```
+
+
+
diff --git a/recipes/finetuning/singlegpu_finetuning.md b/recipes/finetuning/singlegpu_finetuning.md
new file mode 100644
index 0000000000000000000000000000000000000000..3e9eea433d8295fca2820071b9efa495b41db625
--- /dev/null
+++ b/recipes/finetuning/singlegpu_finetuning.md
@@ -0,0 +1,62 @@
+# Fine-tuning with Single GPU
+This recipe steps you through how to finetune a Llama 2 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on a single GPU.
+
+These are the instructions for using the canonical [finetuning script](../../src/llama_recipes/finetuning.py) in the llama-recipes package.
+
+
+## Requirements
+
+Ensure that you have installed the llama-recipes package ([details](../../../README.md#installing)).
+
+To run fine-tuning on a single GPU, we will make use of two packages:
+1. [PEFT](https://github.com/huggingface/peft) to use parameter-efficient finetuning.
+2. [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) for int8 quantization.
+
+
+## How to run it?
+
+```bash
+python -m finetuning.py  --use_peft --peft_method lora --quantization --use_fp16 --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
+```
+The args used in the command above are:
+
+* `--use_peft` boolean flag to enable PEFT methods in the script
+* `--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`, `prefix`.
+* `--quantization` boolean flag to enable int8 quantization
+
+> [!NOTE]  
+> In case you are using a multi-GPU machine please make sure to only make one of them visible using `export CUDA_VISIBLE_DEVICES=GPU:id`.
+
+ 
+### How to run with different datasets?
+
+Currently 3 open source datasets are supported that can be found in [Datasets config file](../../src/llama_recipes/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).
+
+* `grammar_dataset` : use this [notebook](../../src/llama_recipes/datasets/grammar_dataset/grammar_dataset_process.ipynb) to pull and process the Jfleg and C4 200M datasets for grammar checking.
+
+* `alpaca_dataset` : to get this open source data please download the `aplaca.json` to `dataset` folder.
+
+
+```bash
+wget -P ../../src/llama_recipes/datasets https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json
+```
+
+* `samsum_dataset`
+
+to run with each of the datasets set the `dataset` flag in the command as shown below:
+
+```bash
+# grammer_dataset
+
+python -m finetuning.py  --use_peft --peft_method lora --quantization  --dataset grammar_dataset --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
+
+# alpaca_dataset
+
+python -m finetuning.py  --use_peft --peft_method lora --quantization  --dataset alpaca_dataset --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
+
+
+# samsum_dataset
+
+python -m finetuning.py  --use_peft --peft_method lora --quantization  --dataset samsum_dataset --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
+
+```
diff --git a/demo_apps/Llama2_Gradio.ipynb b/recipes/inference/llama_web_ui/Llama2_Gradio.ipynb
similarity index 94%
rename from demo_apps/Llama2_Gradio.ipynb
rename to recipes/inference/llama_web_ui/Llama2_Gradio.ipynb
index 77b4088cd1bb99c158ba148cfc05f25f3d362e7d..449c2b811478316591159b253528895ec54fe803 100644
--- a/demo_apps/Llama2_Gradio.ipynb
+++ b/recipes/inference/llama_web_ui/Llama2_Gradio.ipynb
@@ -1,5 +1,15 @@
 {
  "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e4532411",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# TODO REFACTOR: Integrate code from _legacy/inference.py into this notebook"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "47a9adb3",
diff --git a/recipes/inference/llama_web_ui/README.md b/recipes/inference/llama_web_ui/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1651892819e42ba65fd32f983d06cf86b0ca020d
--- /dev/null
+++ b/recipes/inference/llama_web_ui/README.md
@@ -0,0 +1,25 @@
+## Quick Web UI for Llama2 Chat
+If you prefer to see Llama2 in action in a web UI, instead of the notebooks above, you can try one of the two methods:
+
+### Running [Streamlit](https://streamlit.io/) with Llama2
+Open a Terminal, run the following commands:
+```
+pip install streamlit langchain replicate
+git clone https://github.com/facebookresearch/llama-recipes
+cd llama-recipes/llama-demo-apps
+```
+
+Replace the `<your replicate api token>` in `streamlit_llama2.py` with your API token created [here](https://replicate.com/account/api-tokens) - for more info, see the note [above](#replicate_note).
+
+Then run the command `streamlit run streamlit_llama2.py` and you'll see on your browser the following UI with question and answer - you can enter new text question, click Submit, and see Llama2's answer:
+
+![](../../../docs/images/llama2-streamlit.png)
+![](../../../docs/images/llama2-streamlit2.png)
+
+### Running [Gradio](https://www.gradio.app/) with Llama2 (using [Replicate](Llama2_Gradio.ipynb) or [OctoAI](../../llama_api_providers/OctoAI_API_examples/Llama2_Gradio.ipynb))
+
+To see how to query Llama2 and get answers with the Gradio UI both from the notebook and web, just launch the notebook `Llama2_Gradio.ipynb`. For more info, on how to get set up with a token to power these apps, see the note on [Replicate](../../README.md#replicate_note) and [OctoAI](../../README.md##octoai_note).
+
+Then enter your question, click Submit. You'll see in the notebook or a browser with URL http://127.0.0.1:7860 the following UI:
+
+![](../../../docs/images/llama2-gradio.png)
diff --git a/recipes/inference/llama_web_ui/requirements.txt b/recipes/inference/llama_web_ui/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..9560deffd3743747356c838bea2a6a9633add84d
--- /dev/null
+++ b/recipes/inference/llama_web_ui/requirements.txt
@@ -0,0 +1,3 @@
+streamlit
+langchain
+replicate
\ No newline at end of file
diff --git a/demo_apps/streamlit_llama2.py b/recipes/inference/llama_web_ui/streamlit_llama2.py
similarity index 94%
rename from demo_apps/streamlit_llama2.py
rename to recipes/inference/llama_web_ui/streamlit_llama2.py
index b0de27f0cc5e556a6ec36eb1aabbecca3f6b4921..be3ee7ddff820136aaa49082c96bc2d612115a9e 100644
--- a/demo_apps/streamlit_llama2.py
+++ b/recipes/inference/llama_web_ui/streamlit_llama2.py
@@ -1,6 +1,8 @@
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
 
+# TODO REFACTOR: Convert this to an ipynb notebook
+
 import streamlit as st
 from langchain.llms import Replicate
 import os
diff --git a/recipes/inference/local_inference/README.md b/recipes/inference/local_inference/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..a943d464e4d46e2c815995894d0bd37e785963ce
--- /dev/null
+++ b/recipes/inference/local_inference/README.md
@@ -0,0 +1,87 @@
+# Local Inference
+
+For local inference we have provided an [inference script](inference.py). Depending on the type of finetuning performed during training the [inference script](inference.py) takes different arguments.
+To finetune all model parameters the output dir of the training has to be given as --model_name argument.
+In the case of a parameter efficient method like lora the base model has to be given as --model_name and the output dir of the training has to be given as --peft_model argument.
+Additionally, a prompt for the model in the form of a text file has to be provided. The prompt file can either be piped through standard input or given as --prompt_file parameter.
+
+**Content Safety**
+The inference script also supports safety checks for both user prompt and model outputs. In particular, we use two packages, [AuditNLG](https://github.com/salesforce/AuditNLG/tree/main) and [Azure content safety](https://pypi.org/project/azure-ai-contentsafety/1.0.0b1/).
+
+**Note**
+If using Azure content Safety, please make sure to get the endpoint and API key as described [here](https://pypi.org/project/azure-ai-contentsafety/1.0.0b1/) and add them as  the following environment variables,`CONTENT_SAFETY_ENDPOINT` and `CONTENT_SAFETY_KEY`.
+
+Examples:
+
+ ```bash
+# Full finetuning of all parameters
+cat <test_prompt_file> | python inference.py --model_name <training_config.output_dir> --use_auditnlg
+# PEFT method
+cat <test_prompt_file> | python inference.py --model_name <training_config.model_name> --peft_model <training_config.output_dir> --use_auditnlg
+# prompt as parameter
+python inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg
+ ```
+The  folder contains test prompts for summarization use-case:
+```
+samsum_prompt.txt
+...
+```
+
+**Note**
+Currently pad token by default in [HuggingFace Tokenizer is `None`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py#L110). We add the padding token as a special token to the tokenizer, which in this case requires to resize the token_embeddings as shown below:
+
+```python
+tokenizer.add_special_tokens(
+        {
+
+            "pad_token": "<PAD>",
+        }
+    )
+model.resize_token_embeddings(model.config.vocab_size + 1)
+```
+Padding would be required for batch inference. In this this [example](inference.py), batch size = 1 so essentially padding is not required. However,We added the code pointer as an example in case of batch inference.
+
+
+## Chat completion
+The inference folder also includes a chat completion example, that adds built-in safety features in fine-tuned models to the prompt tokens. To run the example:
+
+```bash
+python chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file chat_completion/chats.json  --quantization --use_auditnlg
+
+```
+
+## Flash Attention and Xformer Memory Efficient Kernels
+
+Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up inference when used for batched inputs. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
+
+```bash
+python chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file chat_completion/chats.json  --quantization --use_auditnlg --use_fast_kernels
+
+python inference.py --model_name <training_config.output_dir> --peft_model <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg --use_fast_kernels
+
+```
+
+## Loading back FSDP checkpoints
+
+In case you have fine-tuned your model with pure FSDP and saved the checkpoints with "SHARDED_STATE_DICT" as shown [here](../../../src/llama_recipes/configs/fsdp.py), you can use this converter script to convert the FSDP Sharded checkpoints into HuggingFace checkpoints. This enables you to use the inference script normally as mentioned above.
+**To convert the checkpoint use the following command**:
+
+This is helpful if you have fine-tuned you model using FSDP only as follows:
+
+```bash
+torchrun --nnodes 1 --nproc_per_node 8  recipes/finetuning/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16
+```
+Then convert your FSDP checkpoint to HuggingFace checkpoints using:
+```bash
+ python -m llama_recipes.inference.checkpoint_converter_fsdp_hf --fsdp_checkpoint_path  PATH/to/FSDP/Checkpoints --consolidated_model_path PATH/to/save/checkpoints --HF_model_path_or_name PATH/or/HF/model_name
+
+ # --HF_model_path_or_name specifies the HF Llama model name or path where it has config.json and tokenizer.json
+ ```
+By default, training parameter are saved in `train_params.yaml` in the path where FSDP checkpoints are saved, in the converter script we frist try to find the HugingFace model name used in the fine-tuning to load the model with configs from there, if not found user need to provide it.
+
+Then run inference using:
+
+```bash
+python inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> 
+
+```
\ No newline at end of file
diff --git a/examples/chat_completion/chat_completion.py b/recipes/inference/local_inference/chat_completion/chat_completion.py
similarity index 100%
rename from examples/chat_completion/chat_completion.py
rename to recipes/inference/local_inference/chat_completion/chat_completion.py
diff --git a/examples/chat_completion/chats.json b/recipes/inference/local_inference/chat_completion/chats.json
similarity index 100%
rename from examples/chat_completion/chats.json
rename to recipes/inference/local_inference/chat_completion/chats.json
diff --git a/examples/inference.py b/recipes/inference/local_inference/inference.py
similarity index 100%
rename from examples/inference.py
rename to recipes/inference/local_inference/inference.py
diff --git a/examples/samsum_prompt.txt b/recipes/inference/local_inference/samsum_prompt.txt
similarity index 100%
rename from examples/samsum_prompt.txt
rename to recipes/inference/local_inference/samsum_prompt.txt
diff --git a/recipes/inference/model_servers/README.md b/recipes/inference/model_servers/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..89581b32f9515d2a17e9c48c7eb0cd9aafe963e4
--- /dev/null
+++ b/recipes/inference/model_servers/README.md
@@ -0,0 +1,4 @@
+## [Running Llama2 On-Prem with vLLM and TGI](llama-on-prem.md)
+This tutorial shows how to use Llama 2 with [vLLM](https://github.com/vllm-project/vllm) and Hugging Face [TGI](https://github.com/huggingface/text-generation-inference) to build Llama 2 on-prem apps.
+
+\* To run a quantized Llama2 model on iOS and Android, you can use  the open source [MLC LLM](https://github.com/mlc-ai/mlc-llm) or [llama.cpp](https://github.com/ggerganov/llama.cpp). You can even make a Linux OS that boots to Llama2 ([repo](https://github.com/trholding/llama2.c)).
\ No newline at end of file
diff --git a/examples/hf_text_generation_inference/README.md b/recipes/inference/model_servers/hf_text_generation_inference/README.md
similarity index 100%
rename from examples/hf_text_generation_inference/README.md
rename to recipes/inference/model_servers/hf_text_generation_inference/README.md
diff --git a/examples/hf_text_generation_inference/merge_lora_weights.py b/recipes/inference/model_servers/hf_text_generation_inference/merge_lora_weights.py
similarity index 100%
rename from examples/hf_text_generation_inference/merge_lora_weights.py
rename to recipes/inference/model_servers/hf_text_generation_inference/merge_lora_weights.py
diff --git a/demo_apps/llama-on-prem.md b/recipes/inference/model_servers/llama-on-prem.md
similarity index 100%
rename from demo_apps/llama-on-prem.md
rename to recipes/inference/model_servers/llama-on-prem.md
diff --git a/examples/vllm/inference.py b/recipes/inference/model_servers/vllm/inference.py
similarity index 100%
rename from examples/vllm/inference.py
rename to recipes/inference/model_servers/vllm/inference.py
diff --git a/demo_apps/Azure_API_example/azure_api_example.ipynb b/recipes/llama_api_providers/Azure_API_example/azure_api_example.ipynb
similarity index 100%
rename from demo_apps/Azure_API_example/azure_api_example.ipynb
rename to recipes/llama_api_providers/Azure_API_example/azure_api_example.ipynb
diff --git a/demo_apps/OctoAI_API_examples/Getting_to_know_Llama.ipynb b/recipes/llama_api_providers/OctoAI_API_examples/Getting_to_know_Llama.ipynb
similarity index 100%
rename from demo_apps/OctoAI_API_examples/Getting_to_know_Llama.ipynb
rename to recipes/llama_api_providers/OctoAI_API_examples/Getting_to_know_Llama.ipynb
diff --git a/demo_apps/OctoAI_API_examples/HelloLlamaCloud.ipynb b/recipes/llama_api_providers/OctoAI_API_examples/HelloLlamaCloud.ipynb
similarity index 100%
rename from demo_apps/OctoAI_API_examples/HelloLlamaCloud.ipynb
rename to recipes/llama_api_providers/OctoAI_API_examples/HelloLlamaCloud.ipynb
diff --git a/demo_apps/OctoAI_API_examples/LiveData.ipynb b/recipes/llama_api_providers/OctoAI_API_examples/LiveData.ipynb
similarity index 100%
rename from demo_apps/OctoAI_API_examples/LiveData.ipynb
rename to recipes/llama_api_providers/OctoAI_API_examples/LiveData.ipynb
diff --git a/demo_apps/OctoAI_API_examples/Llama2_Gradio.ipynb b/recipes/llama_api_providers/OctoAI_API_examples/Llama2_Gradio.ipynb
similarity index 100%
rename from demo_apps/OctoAI_API_examples/Llama2_Gradio.ipynb
rename to recipes/llama_api_providers/OctoAI_API_examples/Llama2_Gradio.ipynb
diff --git a/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb b/recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb
similarity index 100%
rename from demo_apps/OctoAI_API_examples/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb
rename to recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb
diff --git a/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/data/Llama Getting Started Guide.pdf b/recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/data/Llama Getting Started Guide.pdf
similarity index 100%
rename from demo_apps/OctoAI_API_examples/RAG_Chatbot_example/data/Llama Getting Started Guide.pdf
rename to recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/data/Llama Getting Started Guide.pdf
diff --git a/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/requirements.txt b/recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/requirements.txt
similarity index 100%
rename from demo_apps/OctoAI_API_examples/RAG_Chatbot_example/requirements.txt
rename to recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/requirements.txt
diff --git a/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/vectorstore/db_faiss/index.faiss b/recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/vectorstore/db_faiss/index.faiss
similarity index 100%
rename from demo_apps/OctoAI_API_examples/RAG_Chatbot_example/vectorstore/db_faiss/index.faiss
rename to recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/vectorstore/db_faiss/index.faiss
diff --git a/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/vectorstore/db_faiss/index.pkl b/recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/vectorstore/db_faiss/index.pkl
similarity index 100%
rename from demo_apps/OctoAI_API_examples/RAG_Chatbot_example/vectorstore/db_faiss/index.pkl
rename to recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/vectorstore/db_faiss/index.pkl
diff --git a/demo_apps/OctoAI_API_examples/VideoSummary.ipynb b/recipes/llama_api_providers/OctoAI_API_examples/VideoSummary.ipynb
similarity index 100%
rename from demo_apps/OctoAI_API_examples/VideoSummary.ipynb
rename to recipes/llama_api_providers/OctoAI_API_examples/VideoSummary.ipynb
diff --git a/examples/examples_with_aws/Prompt_Engineering_with_Llama_2_On_Amazon_Bedrock.ipynb b/recipes/llama_api_providers/examples_with_aws/Prompt_Engineering_with_Llama_2_On_Amazon_Bedrock.ipynb
similarity index 100%
rename from examples/examples_with_aws/Prompt_Engineering_with_Llama_2_On_Amazon_Bedrock.ipynb
rename to recipes/llama_api_providers/examples_with_aws/Prompt_Engineering_with_Llama_2_On_Amazon_Bedrock.ipynb
diff --git a/examples/examples_with_aws/ReAct_Llama_2_Bedrock-WK.ipynb b/recipes/llama_api_providers/examples_with_aws/ReAct_Llama_2_Bedrock-WK.ipynb
similarity index 100%
rename from examples/examples_with_aws/ReAct_Llama_2_Bedrock-WK.ipynb
rename to recipes/llama_api_providers/examples_with_aws/ReAct_Llama_2_Bedrock-WK.ipynb
diff --git a/examples/examples_with_aws/getting_started_llama2_on_amazon_bedrock.ipynb b/recipes/llama_api_providers/examples_with_aws/getting_started_llama2_on_amazon_bedrock.ipynb
similarity index 100%
rename from examples/examples_with_aws/getting_started_llama2_on_amazon_bedrock.ipynb
rename to recipes/llama_api_providers/examples_with_aws/getting_started_llama2_on_amazon_bedrock.ipynb
diff --git a/examples/Getting_to_know_Llama.ipynb b/recipes/quickstart/Getting_to_know_Llama.ipynb
similarity index 100%
rename from examples/Getting_to_know_Llama.ipynb
rename to recipes/quickstart/Getting_to_know_Llama.ipynb
diff --git a/examples/Prompt_Engineering_with_Llama_2.ipynb b/recipes/quickstart/Prompt_Engineering_with_Llama_2.ipynb
similarity index 100%
rename from examples/Prompt_Engineering_with_Llama_2.ipynb
rename to recipes/quickstart/Prompt_Engineering_with_Llama_2.ipynb
diff --git a/recipes/quickstart/Running_Llama2_Anywhere/Running_Llama_on_HF_transformers.ipynb b/recipes/quickstart/Running_Llama2_Anywhere/Running_Llama_on_HF_transformers.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..7f16efec2d61c69528e182387ed068caf52f112b
--- /dev/null
+++ b/recipes/quickstart/Running_Llama2_Anywhere/Running_Llama_on_HF_transformers.ipynb
@@ -0,0 +1,304 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Running Llama2 on Google Colab using Hugging Face transformers library\n",
+    "This notebook goes over how you can set up and run Llama2 using Hugging Face transformers library"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Steps at a glance:\n",
+    "This demo showcases how to run the example with already converted Llama 2 weights on [Hugging Face](https://huggingface.co/meta-llama). Please Note: To use the downloads on Hugging Face, you must first request a download as shown in the steps below making sure that you are using the same email address as your Hugging Face account.\n",
+    "\n",
+    "To use already converted weights, start here:\n",
+    "1. Request download of model weights from the Llama website\n",
+    "2. Prepare the script\n",
+    "3. Run the example\n",
+    "\n",
+    "\n",
+    "Else, if you'd like to download the models locally and convert them to the HF format, follow the steps below to convert the weights:\n",
+    "1. Request download of model weights from the Llama website\n",
+    "2. Clone the llama repo and get the weights\n",
+    "3. Convert the model weights\n",
+    "4. Prepare the script\n",
+    "5. Run the example"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Using already converted weights"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 1. Request download of model weights from the Llama website\n",
+    "Request download of model weights from the Llama website\n",
+    "Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on â€œdownload modelsâ€. \n",
+    "\n",
+    "Fill  the required information, select the models â€œLlama 2 & Llama Chatâ€ and accept the terms & conditions. You will receive a URL in your email in a short time."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 2. Prepare the script\n",
+    "\n",
+    "We will install the Transformers library and Accelerate library for our demo.\n",
+    "\n",
+    "The `Transformers` library provides many models to perform tasks on texts such as classification, question answering, text generation, etc.\n",
+    "The `accelerate` library enables the same PyTorch code to be run across any distributed configuration of GPUs and CPUs.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install transformers\n",
+    "!pip install accelerate"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, we will import AutoTokenizer, which is a class from the transformers library that automatically chooses the correct tokenizer for a given pre-trained model, import transformers library and torch for PyTorch.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoTokenizer\n",
+    "import transformers\n",
+    "import torch"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Then, we will set the model variable to a specific model weâ€™d like to use. In this demo, we will use the 7b chat model `meta-llama/Llama-2-7b-chat-hf`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model = \"meta-llama/Llama-2-7b-chat-hf\"\n",
+    "tokenizer = AutoTokenizer.from_pretrained(model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, we will use the `from_pretrained` method of `AutoTokenizer` to create a tokenizer. This will download and cache the pre-trained tokenizer and return an instance of the appropriate tokenizer class.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pipeline = transformers.pipeline(\n",
+    "\"text-generation\",\n",
+    "      model=model,\n",
+    "      torch_dtype=torch.float16,\n",
+    " device_map=\"auto\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 3. Run the example\n",
+    "\n",
+    "Now, letâ€™s create the pipeline for text generation. Weâ€™ll also set the device_map argument to `auto`, which means the pipeline will automatically use a GPU if one is available.\n",
+    "\n",
+    "Letâ€™s also generate a text sequence based on the input that we provide. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sequences = pipeline(\n",
+    "    'I have tomatoes, basil and cheese at home. What can I cook for dinner?\\n',\n",
+    "    do_sample=True,\n",
+    "    top_k=10,\n",
+    "    num_return_sequences=1,\n",
+    "    eos_token_id=tokenizer.eos_token_id,\n",
+    "    truncation = True,\n",
+    "    max_length=400,\n",
+    ")\n",
+    "\n",
+    "for seq in sequences:\n",
+    "    print(f\"Result: {seq['generated_text']}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br>\n",
+    "\n",
+    "### Downloading and converting weights to Hugging Face format"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 1. Request download of model weights from the Llama website\n",
+    "Request download of model weights from the Llama website\n",
+    "Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on â€œdownload modelsâ€. \n",
+    "\n",
+    "Fill  the required information, select the models â€œLlama 2 & Llama Chatâ€ and accept the terms & conditions. You will receive a URL in your email in a short time.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 2. Clone the llama repo and get the weights\n",
+    "Git clone the [Llama repo](https://github.com/facebookresearch/llama.git). Enter the URL and get 7B-chat weights. This will download the tokenizer.model, and a directory llama-2-7b-chat with the weights in it.\n",
+    "\n",
+    "This example demonstrates a llama2 model with 7B-chat parameters, but the steps we follow would be similar for other llama models, as well as for other parameter models.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 3. Convert the model weights\n",
+    "\n",
+    "* Create a link to the tokenizer:\n",
+    "Run `ln -h ./tokenizer.model ./llama-2-7b-chat/tokenizer.model`  \n",
+    "\n",
+    "\n",
+    "* Convert the model weights to run with Hugging Face:``TRANSFORM=`python -c \"import transformers;print('/'.join(transformers.__file__.split('/')[:-1])+'/models/llama/convert_llama_weights_to_hf.py')\"``\n",
+    "\n",
+    "* Then run: `pip install protobuf && python $TRANSFORM --input_dir ./llama-2-7b-chat --model_size 7B --output_dir ./llama-2-7b-chat-hf`\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "#### 4. Prepare the script\n",
+    "Import the following necessary modules in your script: \n",
+    "* `LlamaForCausalLM` is the Llama 2 model class\n",
+    "* `LlamaTokenizer` prepares your prompt for the model to process\n",
+    "* `pipeline` is an abstraction to generate model outputs\n",
+    "* `torch` allows us to use PyTorch and specify the datatype weâ€™d like to use."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import transformers\n",
+    "from transformers import LlamaForCausalLM, LlamaTokenizer\n",
+    "\n",
+    "\n",
+    "model_dir = \"./llama-2-7b-chat-hf\"\n",
+    "model = LlamaForCausalLM.from_pretrained(model_dir)\n",
+    "\n",
+    "tokenizer = LlamaTokenizer.from_pretrained(model_dir)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We need a way to use our model for inference. Pipeline allows us to specify which type of task the pipeline needs to run (`text-generation`), specify the model that the pipeline should use to make predictions (`model`), define the precision to use this model (`torch.float16`), device on which the pipeline should run (`device_map`)  among various other options. \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pipeline = transformers.pipeline(\n",
+    "    \"text-generation\",\n",
+    "    model=model,\n",
+    "    tokenizer=tokenizer,\n",
+    "    torch_dtype=torch.float16,\n",
+    "    device_map=\"auto\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we have our pipeline defined, and we need to provide some text prompts as inputs to our pipeline to use when it runs to generate responses (`sequences`). The pipeline shown in the example below sets `do_sample` to True, which allows us to specify the decoding strategy weâ€™d like to use to select the next token from the probability distribution over the entire vocabulary. In our example, we are using top_k sampling. \n",
+    "\n",
+    "By changing `max_length`, you can specify how long youâ€™d like the generated response to be. \n",
+    "Setting the `num_return_sequences` parameter to greater than one will let you generate more than one output.\n",
+    "\n",
+    "In your script, add the following to provide input, and information on how to run the pipeline:\n",
+    "\n",
+    "\n",
+    "#### 5. Run the example"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sequences = pipeline(\n",
+    "    'I have tomatoes, basil and cheese at home. What can I cook for dinner?\\n',\n",
+    "    do_sample=True,\n",
+    "    top_k=10,\n",
+    "    num_return_sequences=1,\n",
+    "    eos_token_id=tokenizer.eos_token_id,\n",
+    "    max_length=400,\n",
+    ")\n",
+    "for seq in sequences:\n",
+    "    print(f\"{seq['generated_text']}\")\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.8.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/recipes/quickstart/Running_Llama2_Anywhere/Running_Llama_on_Mac.ipynb b/recipes/quickstart/Running_Llama2_Anywhere/Running_Llama_on_Mac.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..4d98119803e8fa6416f166a3ae9f2b971d4a66a3
--- /dev/null
+++ b/recipes/quickstart/Running_Llama2_Anywhere/Running_Llama_on_Mac.ipynb
@@ -0,0 +1,219 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Running Llama2 on Mac\n",
+    "This notebook goes over how you can set up and run Llama2 locally on a Mac using llama-cpp-python and the llama-cpp's quantized Llama2 model. It also goes over how to use LangChain to ask Llama general questions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Steps at a glance:\n",
+    "1. Use CMAKE and install required packages\n",
+    "2. Request download of model weights from the Llama website\n",
+    "3. Clone the llama repo and get the weights\n",
+    "4. Clone the llamacpp repo and quantize the model\n",
+    "5. Prepare the script\n",
+    "6. Run the example\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br>\n",
+    "\n",
+    "#### 1. Use CMAKE and install required packages\n",
+    "\n",
+    "Type the following command:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1: sets the appropriate build configuration options for the llama-cpp-python package \n",
+    "#and enables the use of Metal in Mac and forces the use of CMake as the build system.\n",
+    "!CMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1 pip install llama-cpp-python\n",
+    "\n",
+    "#pip install llama-cpp-python: installs the llama-cpp-python package and its dependencies:\n",
+    "!pip install pypdf sentence-transformers chromadb langchain"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If running without a Jupyter notebook, use the command without the `!`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A brief look at the installed libraries:\n",
+    "- [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) a simple Python bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) library\n",
+    "- pypdf gives us the ability to work with pdfs\n",
+    "- sentence-transformers for text embeddings\n",
+    "- chromadb gives us database capabilities \n",
+    "- langchain provides necessary RAG tools for this demo"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br>\n",
+    "\n",
+    "#### 2. Request download of model weights from the Llama website\n",
+    "Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on â€œdownload modelsâ€. \n",
+    "Fill  the required information, select the models â€œLlama 2 & Llama Chatâ€ and accept the terms & conditions. You will receive a URL in your email in a short time.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br>\n",
+    "\n",
+    "#### 3. Clone the llama repo and get the weights\n",
+    "Git clone the [Llama repo](https://github.com/facebookresearch/llama.git). Enter the URL and get 13B weights. This example demonstrates a llama2 model with 13B parameters, but the steps we follow would be similar for other llama models, as well as for other parameter models.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br>\n",
+    "\n",
+    "#### 4. Clone the llamacpp repo and quantize the model\n",
+    "* Git clone the [Llamacpp repo](https://github.com/ggerganov/llama.cpp). \n",
+    "* Enter the repo:\n",
+    "`cd llama.cpp`\n",
+    "* Install requirements:\n",
+    "`python3 -m pip install -r requirements.txt`\n",
+    "* Convert the weights:\n",
+    "`python convert.py <path_to_your_downloaded_llama-2-13b_model>`\n",
+    "* Run make to generate the 'quantize' method that we will use in the next step\n",
+    "`make`\n",
+    "* Quantize the weights:\n",
+    "`./quantize <path_to_your_downloaded_llama-2-13b_model>/ggml-model-f16.gguf <path_to_your_downloaded_llama-2-13b_model>/ggml-model-q4_0.gguf q4_0`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "#### 5. Prepare the script\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# mentions the instance of the Llama model that we will use\n",
+    "from langchain.llms import LlamaCpp\n",
+    "\n",
+    "# defines a chain of operations that can be performed on text input to generate the output using the LLM\n",
+    "from langchain.chains import LLMChain\n",
+    "\n",
+    "# manages callbacks that are triggered at various stages during the execution of an LLMChain\n",
+    "from langchain.callbacks.manager import CallbackManager\n",
+    "\n",
+    "# defines a callback that streams the output of the LLMChain to the console in real-time as it gets generated\n",
+    "from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler\n",
+    "\n",
+    "# allows to define prompt templates that can be used to generate custom inputs for the LLM\n",
+    "from langchain.prompts import PromptTemplate\n",
+    "\n",
+    "\n",
+    "# Initialize the langchain CallBackManager. This handles callbacks from Langchain and for this example we will use \n",
+    "# for token-wise streaming so you'll see the answer gets generated token by token when Llama is answering your question\n",
+    "callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])\n",
+    "\n",
+    "# Set up the model\n",
+    "llm = LlamaCpp(\n",
+    "    model_path=\"<path-to-llama-gguf-file>\",\n",
+    "    temperature=0.0,\n",
+    "    top_p=1,\n",
+    "    n_ctx=6000,\n",
+    "    callback_manager=callback_manager, \n",
+    "    verbose=True,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 6. Run the example\n",
+    "\n",
+    "With the model set up, you are now ready to ask some questions. \n",
+    "\n",
+    "Here is an example of the simplest way to ask the model some general questions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Run the example\n",
+    "question = \"who wrote the book Pride and Prejudice?\"\n",
+    "answer = llm(question)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Alternatively, you can use LangChain's `PromptTemplate` for some flexibility in your prompts and questions. For more information on LangChain's prompt template visit this [link](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "prompt = PromptTemplate.from_template(\n",
+    "    \"who wrote {book}?\"\n",
+    ")\n",
+    "chain = LLMChain(llm=llm, prompt=prompt)\n",
+    "answer = chain.run(\"A tale of two cities\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/examples/Purple_Llama_Anyscale.ipynb b/recipes/responsible_ai/Purple_Llama_Anyscale.ipynb
similarity index 100%
rename from examples/Purple_Llama_Anyscale.ipynb
rename to recipes/responsible_ai/Purple_Llama_Anyscale.ipynb
diff --git a/examples/Purple_Llama_OctoAI.ipynb b/recipes/responsible_ai/Purple_Llama_OctoAI.ipynb
similarity index 100%
rename from examples/Purple_Llama_OctoAI.ipynb
rename to recipes/responsible_ai/Purple_Llama_OctoAI.ipynb
diff --git a/recipes/responsible_ai/README.md b/recipes/responsible_ai/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..128dcfd567a436d0d2961f71b3cee984231cd52a
--- /dev/null
+++ b/recipes/responsible_ai/README.md
@@ -0,0 +1,11 @@
+# Llama Guard
+
+Llama Guard is a new experimental model that provides input and output guardrails for LLM deployments. For more details, please visit the main [repository](https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard).
+
+**Note** Please find the right model on HF side [here](https://huggingface.co/meta-llama/LlamaGuard-7b). 
+
+### Running locally
+The [llama_guard](llama_guard) folder contains the inference script to run Llama Guard locally. Add test prompts directly to the [inference script](llama_guard/inference.py) before running it.
+
+### Running on the cloud
+The notebooks [Purple_Llama_Anyscale](Purple_Llama_Anyscale.ipynb) & [Purple_Llama_OctoAI](Purple_Llama_OctoAI.ipynb) contain examples for running Llama Guard on cloud hosted endpoints.
\ No newline at end of file
diff --git a/examples/llama_guard/README.md b/recipes/responsible_ai/llama_guard/README.md
similarity index 92%
rename from examples/llama_guard/README.md
rename to recipes/responsible_ai/llama_guard/README.md
index 417bb61c9dd18a95227233f8e353d7270f90c416..97dd31114a4a2c83f4137e9b1ec9bfa330db9b7c 100644
--- a/examples/llama_guard/README.md
+++ b/recipes/responsible_ai/llama_guard/README.md
@@ -1,6 +1,6 @@
 # Llama Guard demo
 <!-- markdown-link-check-disable -->
-Llama Guard is a new experimental model that provides input and output guardrails for LLM deployments. For more details, please visit the main [repository](https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard).
+Llama Guard is a language model that provides input and output guardrails for LLM deployments. For more details, please visit the main [repository](https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard).
 
 This folder contains an example file to run Llama Guard inference directly. 
 
diff --git a/examples/llama_guard/__init__.py b/recipes/responsible_ai/llama_guard/__init__.py
similarity index 100%
rename from examples/llama_guard/__init__.py
rename to recipes/responsible_ai/llama_guard/__init__.py
diff --git a/examples/llama_guard/inference.py b/recipes/responsible_ai/llama_guard/inference.py
similarity index 100%
rename from examples/llama_guard/inference.py
rename to recipes/responsible_ai/llama_guard/inference.py
diff --git a/demo_apps/LiveData.ipynb b/recipes/use_cases/LiveData.ipynb
similarity index 100%
rename from demo_apps/LiveData.ipynb
rename to recipes/use_cases/LiveData.ipynb
diff --git a/demo_apps/HelloLlamaCloud.ipynb b/recipes/use_cases/RAG/HelloLlamaCloud.ipynb
similarity index 100%
rename from demo_apps/HelloLlamaCloud.ipynb
rename to recipes/use_cases/RAG/HelloLlamaCloud.ipynb
diff --git a/demo_apps/HelloLlamaLocal.ipynb b/recipes/use_cases/RAG/HelloLlamaLocal.ipynb
similarity index 100%
rename from demo_apps/HelloLlamaLocal.ipynb
rename to recipes/use_cases/RAG/HelloLlamaLocal.ipynb
diff --git a/demo_apps/llama2.pdf b/recipes/use_cases/RAG/llama2.pdf
similarity index 100%
rename from demo_apps/llama2.pdf
rename to recipes/use_cases/RAG/llama2.pdf
diff --git a/recipes/use_cases/README.md b/recipes/use_cases/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..8134f74537d46c09dfd4bcb6dc4a0da27c2a994c
--- /dev/null
+++ b/recipes/use_cases/README.md
@@ -0,0 +1,17 @@
+## VideoSummary: Ask Llama2 to Summarize a YouTube Video (using [Replicate](VideoSummary.ipynb) or [OctoAI](../llama_api_providers/OctoAI_API_examples/VideoSummary.ipynb))
+This demo app uses Llama2 to return a text summary of a YouTube video. It shows how to retrieve the caption of a YouTube video and how to ask Llama to summarize the content in four different ways, from the simplest naive way that works for short text to more advanced methods of using LangChain's map_reduce and refine to overcome the 4096 limit of Llama's max input token size.
+
+## [NBA2023-24](./text2sql/StructuredLlama.ipynb): Ask Llama2 about Structured Data
+This demo app shows how to use LangChain and Llama2 to let users ask questions about **structured** data stored in a SQL DB. As the 2023-24 NBA season is around the corner, we use the NBA roster info saved in a SQLite DB to show you how to ask Llama2 questions about your favorite teams or players.
+
+## LiveData: Ask Llama2 about Live Data (using [Replicate](LiveData.ipynb) or [OctoAI](../llama_api_providers/OctoAI_API_examples/LiveData.ipynb))
+This demo app shows how to perform live data augmented generation tasks with Llama2 and [LlamaIndex](https://github.com/run-llama/llama_index), another leading open-source framework for building LLM apps: it uses the [You.com search API](https://documentation.you.com/quickstart) to get live search result and ask Llama2 about them.
+
+## [WhatsApp Chatbot](./chatbots/whatsapp_llama/whatsapp_llama2.md): Building a Llama-enabled WhatsApp Chatbot
+This step-by-step tutorial shows how to use the [WhatsApp Business API](https://developers.facebook.com/docs/whatsapp/cloud-api/overview) to build a Llama-enabled WhatsApp chatbot.
+
+## [Messenger Chatbot](./chatbots/messenger_llama/messenger_llama2.md): Building a Llama-enabled Messenger Chatbot
+This step-by-step tutorial shows how to use the [Messenger Platform](https://developers.facebook.com/docs/messenger-platform/overview) to build a Llama-enabled Messenger chatbot.
+
+### RAG Chatbot Example (running [locally](./chatbots/RAG_chatbot/RAG_Chatbot_Example.ipynb) or on [OctoAI](../llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb))
+A complete example of how to build a Llama 2 chatbot hosted on your browser that can answer questions based on your own data using retrieval augmented generation (RAG). You can run Llama2 locally if you have a good enough GPU or on OctoAI if you follow the note [here](../README.md#octoai_note).
\ No newline at end of file
diff --git a/demo_apps/VideoSummary.ipynb b/recipes/use_cases/VideoSummary.ipynb
similarity index 100%
rename from demo_apps/VideoSummary.ipynb
rename to recipes/use_cases/VideoSummary.ipynb
diff --git a/demo_apps/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb b/recipes/use_cases/chatbots/RAG_chatbot/RAG_Chatbot_Example.ipynb
similarity index 100%
rename from demo_apps/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb
rename to recipes/use_cases/chatbots/RAG_chatbot/RAG_Chatbot_Example.ipynb
diff --git a/demo_apps/RAG_Chatbot_example/data/Llama Getting Started Guide.pdf b/recipes/use_cases/chatbots/RAG_chatbot/data/Llama Getting Started Guide.pdf
similarity index 100%
rename from demo_apps/RAG_Chatbot_example/data/Llama Getting Started Guide.pdf
rename to recipes/use_cases/chatbots/RAG_chatbot/data/Llama Getting Started Guide.pdf
diff --git a/demo_apps/RAG_Chatbot_example/requirements.txt b/recipes/use_cases/chatbots/RAG_chatbot/requirements.txt
similarity index 100%
rename from demo_apps/RAG_Chatbot_example/requirements.txt
rename to recipes/use_cases/chatbots/RAG_chatbot/requirements.txt
diff --git a/demo_apps/RAG_Chatbot_example/vectorstore/db_faiss/index.faiss b/recipes/use_cases/chatbots/RAG_chatbot/vectorstore/db_faiss/index.faiss
similarity index 100%
rename from demo_apps/RAG_Chatbot_example/vectorstore/db_faiss/index.faiss
rename to recipes/use_cases/chatbots/RAG_chatbot/vectorstore/db_faiss/index.faiss
diff --git a/demo_apps/RAG_Chatbot_example/vectorstore/db_faiss/index.pkl b/recipes/use_cases/chatbots/RAG_chatbot/vectorstore/db_faiss/index.pkl
similarity index 100%
rename from demo_apps/RAG_Chatbot_example/vectorstore/db_faiss/index.pkl
rename to recipes/use_cases/chatbots/RAG_chatbot/vectorstore/db_faiss/index.pkl
diff --git a/demo_apps/llama_messenger.py b/recipes/use_cases/chatbots/messenger_llama/llama_messenger.py
similarity index 100%
rename from demo_apps/llama_messenger.py
rename to recipes/use_cases/chatbots/messenger_llama/llama_messenger.py
diff --git a/demo_apps/messenger_llama2.md b/recipes/use_cases/chatbots/messenger_llama/messenger_llama2.md
similarity index 94%
rename from demo_apps/messenger_llama2.md
rename to recipes/use_cases/chatbots/messenger_llama/messenger_llama2.md
index f9952eac465e22759d52ba44b6873a9c56287101..9edee3d709f3b1c39fc7278ad2475fe42fecb1d2 100644
--- a/demo_apps/messenger_llama2.md
+++ b/recipes/use_cases/chatbots/messenger_llama/messenger_llama2.md
@@ -2,7 +2,7 @@
 
 This step-by-step tutorial shows the complete process of building a Llama-enabled Messenger chatbot. A demo video of using the iOS Messenger app to send a question to a Facebook business page and receive the Llama 2 generated answer is [here](https://drive.google.com/file/d/1B4ijFH4X3jEHZfkGdTPmdsgpUes_RNud/view).
 
-If you're interested in a Llama-enabled WhatsApp chatbot, see [here](whatsapp_llama2.md) for a tutorial.
+If you're interested in a Llama-enabled WhatsApp chatbot, see [here](../whatsapp_llama/whatsapp_llama2.md) for a tutorial.
 
 ## Overview
 
@@ -10,13 +10,13 @@ Messenger from Meta is a messaging service that allows a Facebook business page
 
 The diagram below shows the components and overall data flow of the Llama 2 enabled Messenger chatbot demo we built, using an Amazon EC2 instance as an example for running the web server.
 
-![](messenger_llama_arch.jpg)
+![](../../../../docs/images/messenger_llama_arch.jpg)
 
 ## Getting Started with Messenger Platform
 
 1. A Facebook Page is required to send and receive messages using the Messenger Platform - see [here](https://www.facebook.com/business/help/461775097570076?id=939256796236247) for details about Facebook Pages and how to create a new page. 
 
-2. If you have followed the [Llama WhatsApp chatbot tutorial](whatsapp_llama2.md), or if you already have a Meta developer account and a business app, then you can skip this step. Otherwise, you need to first [create a Meta developer account](https://developers.facebook.com/) and then [create a business app](https://developers.facebook.com/docs/development/create-an-app/).
+2. If you have followed the [Llama WhatsApp chatbot tutorial](../whatsapp_llama/whatsapp_llama2.md), or if you already have a Meta developer account and a business app, then you can skip this step. Otherwise, you need to first [create a Meta developer account](https://developers.facebook.com/) and then [create a business app](https://developers.facebook.com/docs/development/create-an-app/).
 
 3. Add the Messenger product to your business app by going to your business app's Dashboard, click "Add Product" and then select "Messenger".
 
@@ -24,7 +24,7 @@ The diagram below shows the components and overall data flow of the Llama 2 enab
 
 5. Open Messenger's API Settings, as shown in the screenshot below, then in "1. Configure webhooks", set the Callback URL and Verify Token set up in the previous step, and subscribe all message related fields for "Webhook Fields". Finally, in "2. Generate access tokens", connect your Facebook page (see step 1) and copy your page access token for later use.
 
-![](messenger_api_settings.png)
+![](../../../../docs/images/messenger_api_settings.png)
 
 ## Writing Llama 2 Enabled Web App
 
diff --git a/demo_apps/llama_chatbot.py b/recipes/use_cases/chatbots/whatsapp_llama/llama_chatbot.py
similarity index 100%
rename from demo_apps/llama_chatbot.py
rename to recipes/use_cases/chatbots/whatsapp_llama/llama_chatbot.py
diff --git a/demo_apps/whatsapp_llama2.md b/recipes/use_cases/chatbots/whatsapp_llama/whatsapp_llama2.md
similarity index 98%
rename from demo_apps/whatsapp_llama2.md
rename to recipes/use_cases/chatbots/whatsapp_llama/whatsapp_llama2.md
index ed0bc791d3ad240208b0e43828734af51b1765a0..cc92485f59e034ae2b36106a1cdc5223aca72ce2 100644
--- a/demo_apps/whatsapp_llama2.md
+++ b/recipes/use_cases/chatbots/whatsapp_llama/whatsapp_llama2.md
@@ -2,7 +2,7 @@
 
 This step-by-step tutorial shows the complete process of building a Llama-enabled WhatsApp chatbot. A demo video of using the iOS WhatsApp to send a question to a test phone number and receive the Llama 2 generated answer is [here](https://drive.google.com/file/d/1fZDaOsvyE1yrNGETV-e0SvL14BYeAI6R/view).
 
-If you're interested in a Llama-enabled Messenger chatbot, see [here](messenger_llama2.md) for a tutorial.
+If you're interested in a Llama-enabled Messenger chatbot, see [here](../messenger_llama/messenger_llama2.md) for a tutorial.
 
 ## Overview
 
@@ -10,7 +10,7 @@ Businesses of all sizes can use the [WhatsApp Business API](https://developers.f
 
 The diagram below shows the components and overall data flow of the Llama 2 enabled WhatsApp chatbot demo we built, using Amazon EC2 instance as an example for running the web server.
 
-![](whatsapp_llama_arch.jpg)
+![](../../../../docs/images/whatsapp_llama_arch.jpg)
 
 ## Getting Started with WhatsApp Business Cloud API
 
@@ -25,7 +25,7 @@ For the last step, you need to further follow the [Sample Callback URL for Webho
 
 Now open the [Meta for Develops Apps](https://developers.facebook.com/apps/) page and select the WhatsApp business app and you should be able to copy the curl command (as shown in the App Dashboard - WhatsApp - API Setup - Step 2 below) and run the command on a Terminal to send a test message to your WhatsApp. 
 
-![](whatsapp_dashboard.jpg)
+![](../../../../docs/images/whatsapp_dashboard.jpg)
 
 Note down the "Temporary access token", "Phone number ID", and "a recipient phone number" in the API Setup page above, which will be used later.
 
diff --git a/demo_apps/StructuredLlama.ipynb b/recipes/use_cases/text2sql/StructuredLlama.ipynb
similarity index 100%
rename from demo_apps/StructuredLlama.ipynb
rename to recipes/use_cases/text2sql/StructuredLlama.ipynb
diff --git a/demo_apps/csv2db.py b/recipes/use_cases/text2sql/csv2db.py
similarity index 100%
rename from demo_apps/csv2db.py
rename to recipes/use_cases/text2sql/csv2db.py
diff --git a/demo_apps/nba.txt b/recipes/use_cases/text2sql/nba.txt
similarity index 100%
rename from demo_apps/nba.txt
rename to recipes/use_cases/text2sql/nba.txt
diff --git a/demo_apps/txt2csv.py b/recipes/use_cases/text2sql/txt2csv.py
similarity index 100%
rename from demo_apps/txt2csv.py
rename to recipes/use_cases/text2sql/txt2csv.py
diff --git a/scripts/spellcheck_conf/wordlist.txt b/scripts/spellcheck_conf/wordlist.txt
index 2254328d1fa81469611f5eb82a6edbf9332cb302..6f5e59ea24acec684fb84fdf00f62c11dc2403af 100644
--- a/scripts/spellcheck_conf/wordlist.txt
+++ b/scripts/spellcheck_conf/wordlist.txt
@@ -1228,6 +1228,7 @@ hyperparameters
 jsonl
 VRAM
 HuggingFace
+huggingface
 llamaguard
 LEVELs
 AugmentationConfigs
@@ -1254,7 +1255,11 @@ EleutherAI
 CodeLlama
 LlamaGuard
 OctoAI
+octoai
 OctoAI's
 PurpleLlama
 Youtube
 wandb
+multigpu
+sql
+scalable
\ No newline at end of file
diff --git a/examples/hf_llama_conversion/README.md b/src/llama_recipes/utils/hf_llama_conversion/README.md
similarity index 100%
rename from examples/hf_llama_conversion/README.md
rename to src/llama_recipes/utils/hf_llama_conversion/README.md
diff --git a/examples/hf_llama_conversion/compare_llama_weights.py b/src/llama_recipes/utils/hf_llama_conversion/compare_llama_weights.py
similarity index 100%
rename from examples/hf_llama_conversion/compare_llama_weights.py
rename to src/llama_recipes/utils/hf_llama_conversion/compare_llama_weights.py
diff --git a/examples/plot_metrics.py b/src/llama_recipes/utils/plot_metrics.py
similarity index 100%
rename from examples/plot_metrics.py
rename to src/llama_recipes/utils/plot_metrics.py