Add README for quickstart + update to codellama url (#578)

5be3d4a1 · Suraj Subramanian · GitHub · f01bbe23 · 02a03865 · 5be3d4a1
Unverified Commit 5be3d4a1 authored 8 months ago by Suraj Subramanian Committed by GitHub 8 months ago
--- a/recipes/quickstart/README.md
+++ b/recipes/quickstart/README.md
+## Llama-Recipes Quickstart
+
+If you are new to developing with Meta Llama models, this is where you should start. This folder contains introductory-level notebooks across different techniques relating to Meta Llama.
+
+* The [](./Running_Llama3_Anywhere/) notebooks demonstrate how to run Llama inference across Linux, Mac and Windows platforms using the appropriate tooling.
+* The [](./Prompt_Engineering_with_Llama_3.ipynb) notebook showcases the various ways to elicit appropriate outputs from Llama. Take this notebook for a spin to get a feel for how Llama responds to different inputs and generation parameters.
+* The [](./inference/) folder contains scripts to deploy Llama for inference on server and mobile. See also [](../3p_integration/vllm/) and [](../3p_integration/tgi/) for hosting Llama on open-source model servers.
+* The [](./RAG/) folder contains a simple Retrieval-Augmented Generation application using Llama 3.
+* The [](./finetuning/) folder contains resources to help you finetune Llama 3 on your custom datasets, for both single- and multi-GPU setups. The scripts use the native llama-recipes finetuning code found in [](../../src/llama_recipes/finetuning.py) which supports these features:
+
+| Feature                                        |   |
+| ---------------------------------------------- | - |
+| HF support for finetuning                      | ✅ |
+| Deferred initialization ( meta init)           | ✅ |
+| HF support for inference                       | ✅ |
+| Low CPU mode for multi GPU                     | ✅ |
+| Mixed precision                                | ✅ |
+| Single node quantization                       | ✅ |
+| Flash attention                                | ✅ |
+| PEFT                                           | ✅ |
+| Activation checkpointing FSDP                  | ✅ |
+| Hybrid Sharded Data Parallel (HSDP)            | ✅ |
+| Dataset packing & padding                      | ✅ |
+| BF16 Optimizer ( Pure BF16)                    | ✅ |
+| Profiling & MFU tracking                       | ✅ |
+| Gradient accumulation                          | ✅ |
+| CPU offloading                                 | ✅ |
+| FSDP checkpoint conversion to HF for inference | ✅ |
+| W&B experiment tracker                         | ✅ |
--- a/recipes/quickstart/inference/README.md
+++ b/recipes/quickstart/inference/README.md
+## Quickstart > Inference
+
+This folder contains scripts to get you started with inference on Meta Llama models.
+
+* [](./code_llama/) contains scripts for tasks relating to code generation using CodeLlama
+* [](./local_inference/) contsin scripts to do memory efficient inference on servers and local machines
+* [](./mobile_inference/) has scripts using MLC to serve Llama on Android (h/t to OctoAI for the contribution!)
--- a/recipes/quickstart/inference/code_llama/README.md
+++ b/recipes/quickstart/inference/code_llama/README.md
@@ -4,7 +4,7 @@ Code llama was recently released with three flavors, base-model that support mul

 Find the scripts to run Code Llama, where there are two examples of running code completion and infilling.

-**Note** Please find the right model on HF side [here](https://huggingface.co/codellama). 
+**Note** Please find the right model on HF [here](https://huggingface.co/models?search=meta-llama%20codellama). 

 Make sure to install Transformers from source for now

@@ -36,4 +36,4 @@ To run the 70B Instruct model example run the following (you'll need to enter th
 python code_instruct_example.py --model_name codellama/CodeLlama-70b-Instruct-hf --temperature 0.2 --top_p 0.9

 ```
-You can learn more about the chat prompt template [on HF](https://huggingface.co/codellama/CodeLlama-70b-Instruct-hf#chat-prompt) and [original Code Llama repository](https://github.com/facebookresearch/codellama/blob/main/README.md#fine-tuned-instruction-models). HF tokenizer has already taken care of the chat template as shown in this example. 
+You can learn more about the chat prompt template [on HF](https://huggingface.co/meta-llama/CodeLlama-70b-Instruct-hf#chat-prompt) and [original Code Llama repository](https://github.com/meta-llama/codellama/blob/main/README.md#fine-tuned-instruction-models). HF tokenizer has already taken care of the chat template as shown in this example. 
--- a/recipes/quickstart/inference/local_inference/README.md
+++ b/recipes/quickstart/inference/local_inference/README.md
@@ -61,7 +61,7 @@ python inference.py --model_name <training_config.output_dir> --peft_model <trai

 ```

-## Loading back FSDP checkpoints
+## Inference with FSDP checkpoints

 In case you have fine-tuned your model with pure FSDP and saved the checkpoints with "SHARDED_STATE_DICT" as shown [here](../../../../src/llama_recipes/configs/fsdp.py), you can use this converter script to convert the FSDP Sharded checkpoints into HuggingFace checkpoints. This enables you to use the inference script normally as mentioned above.
 **To convert the checkpoint use the following command**:

--- a/recipes/quickstart/Prompt_Engineering_with_Llama_3.ipynb
+++ b/recipes/quickstart/Prompt_Engineering_with_Llama_3.ipynb