diff --git a/recipes/quickstart/README.md b/recipes/quickstart/README.md new file mode 100644 index 0000000000000000000000000000000000000000..315ab70a8b630c049a8a063b906d5cefd433c464 --- /dev/null +++ b/recipes/quickstart/README.md @@ -0,0 +1,29 @@ +## Llama-Recipes Quickstart + +If you are new to developing with Meta Llama models, this is where you should start. This folder contains introductory-level notebooks across different techniques relating to Meta Llama. + +* The [](./Running_Llama3_Anywhere/) notebooks demonstrate how to run Llama inference across Linux, Mac and Windows platforms using the appropriate tooling. +* The [](./Prompt_Engineering_with_Llama_3.ipynb) notebook showcases the various ways to elicit appropriate outputs from Llama. Take this notebook for a spin to get a feel for how Llama responds to different inputs and generation parameters. +* The [](./inference/) folder contains scripts to deploy Llama for inference on server and mobile. See also [](../3p_integration/vllm/) and [](../3p_integration/tgi/) for hosting Llama on open-source model servers. +* The [](./RAG/) folder contains a simple Retrieval-Augmented Generation application using Llama 3. +* The [](./finetuning/) folder contains resources to help you finetune Llama 3 on your custom datasets, for both single- and multi-GPU setups. The scripts use the native llama-recipes finetuning code found in [](../../src/llama_recipes/finetuning.py) which supports these features: + +| Feature | | +| ---------------------------------------------- | - | +| HF support for finetuning | ✅ | +| Deferred initialization ( meta init) | ✅ | +| HF support for inference | ✅ | +| Low CPU mode for multi GPU | ✅ | +| Mixed precision | ✅ | +| Single node quantization | ✅ | +| Flash attention | ✅ | +| PEFT | ✅ | +| Activation checkpointing FSDP | ✅ | +| Hybrid Sharded Data Parallel (HSDP) | ✅ | +| Dataset packing & padding | ✅ | +| BF16 Optimizer ( Pure BF16) | ✅ | +| Profiling & MFU tracking | ✅ | +| Gradient accumulation | ✅ | +| CPU offloading | ✅ | +| FSDP checkpoint conversion to HF for inference | ✅ | +| W&B experiment tracker | ✅ | diff --git a/recipes/quickstart/inference/README.md b/recipes/quickstart/inference/README.md new file mode 100644 index 0000000000000000000000000000000000000000..025630dd124689580081977835612affe13a865d --- /dev/null +++ b/recipes/quickstart/inference/README.md @@ -0,0 +1,7 @@ +## Quickstart > Inference + +This folder contains scripts to get you started with inference on Meta Llama models. + +* [](./code_llama/) contains scripts for tasks relating to code generation using CodeLlama +* [](./local_inference/) contsin scripts to do memory efficient inference on servers and local machines +* [](./mobile_inference/) has scripts using MLC to serve Llama on Android (h/t to OctoAI for the contribution!) diff --git a/recipes/quickstart/inference/code_llama/README.md b/recipes/quickstart/inference/code_llama/README.md index d5f4bda52e2576d9e6fc33d049129d7c1ec0e54d..ef1be5e83731df0527483695f7c230e7f9acdd82 100644 --- a/recipes/quickstart/inference/code_llama/README.md +++ b/recipes/quickstart/inference/code_llama/README.md @@ -4,7 +4,7 @@ Code llama was recently released with three flavors, base-model that support mul Find the scripts to run Code Llama, where there are two examples of running code completion and infilling. -**Note** Please find the right model on HF side [here](https://huggingface.co/codellama). +**Note** Please find the right model on HF [here](https://huggingface.co/models?search=meta-llama%20codellama). Make sure to install Transformers from source for now @@ -36,4 +36,4 @@ To run the 70B Instruct model example run the following (you'll need to enter th python code_instruct_example.py --model_name codellama/CodeLlama-70b-Instruct-hf --temperature 0.2 --top_p 0.9 ``` -You can learn more about the chat prompt template [on HF](https://huggingface.co/codellama/CodeLlama-70b-Instruct-hf#chat-prompt) and [original Code Llama repository](https://github.com/facebookresearch/codellama/blob/main/README.md#fine-tuned-instruction-models). HF tokenizer has already taken care of the chat template as shown in this example. +You can learn more about the chat prompt template [on HF](https://huggingface.co/meta-llama/CodeLlama-70b-Instruct-hf#chat-prompt) and [original Code Llama repository](https://github.com/meta-llama/codellama/blob/main/README.md#fine-tuned-instruction-models). HF tokenizer has already taken care of the chat template as shown in this example. diff --git a/recipes/quickstart/inference/local_inference/README.md b/recipes/quickstart/inference/local_inference/README.md index acdaf3c0b0a814df0c770eeded0e8f5dff95b344..803a50c85d25b61b3184e91d5418c1e40a7764ea 100644 --- a/recipes/quickstart/inference/local_inference/README.md +++ b/recipes/quickstart/inference/local_inference/README.md @@ -61,7 +61,7 @@ python inference.py --model_name <training_config.output_dir> --peft_model <trai ``` -## Loading back FSDP checkpoints +## Inference with FSDP checkpoints In case you have fine-tuned your model with pure FSDP and saved the checkpoints with "SHARDED_STATE_DICT" as shown [here](../../../../src/llama_recipes/configs/fsdp.py), you can use this converter script to convert the FSDP Sharded checkpoints into HuggingFace checkpoints. This enables you to use the inference script normally as mentioned above. **To convert the checkpoint use the following command**: diff --git a/recipes/quickstart/Prompt_Engineering_with_Llama_3.ipynb b/recipes/quickstart/prompt_engineering/Prompt_Engineering_with_Llama_3.ipynb similarity index 100% rename from recipes/quickstart/Prompt_Engineering_with_Llama_3.ipynb rename to recipes/quickstart/prompt_engineering/Prompt_Engineering_with_Llama_3.ipynb