Merge branch 'main' into 3p/e2b-ai-analyst

b31acd50 · James Murdza · GitHub · 9082c0ba · a1ed4531 · b31acd50
Unverified Commit b31acd50 authored 2 months ago by James Murdza Committed by GitHub 2 months ago
--- a/.github/scripts/spellcheck_conf/wordlist.txt
+++ b/.github/scripts/spellcheck_conf/wordlist.txt
@@ -1508,3 +1508,24 @@ xTTS
 TogetherAI
 Vercel's
 echarts
+pydantic
+Deloitte
+Deloitte's
+Felicis
+Gmail
+LangSmith
+Letta
+NLU
+Norvig's
+OAuth
+Ollama's
+Weng
+dropdown
+globals
+gmail
+multiagent
+yyy
+jpeg
+toend
+codellama
+DIFFLOG
--- a/docs/FAQ.md
+++ b/docs/FAQ.md
@@ -29,20 +29,20 @@ Here we discuss frequently asked questions that may occur and we found useful al

 7. How to handle CUDA memory fragmentations during fine-tuning that may lead into an OOM?

-    In some cases you may experience that after model checkpointing specially with FSDP (this usually does not happen with PEFT methods), the reserved and allocated CUDA memory has increased. This might be due to CUDA memory fragmentations. PyTorch recenly added an enviroment variable that helps to better manage memory fragmentation (this feature in available on PyTorch nightlies at the time of writing this doc July 30 2023). You can set this in your main training script as follows:
+    In some cases you may experience that after model checkpointing specially with FSDP (this usually does not happen with PEFT methods), the reserved and allocated CUDA memory has increased. This might be due to CUDA memory fragmentations. PyTorch recently added an environment variable that helps to better manage memory fragmentation (this feature in available on PyTorch nightlies at the time of writing this doc July 30 2023). You can set this in your main training script as follows:

    ```bash

    os.environ['PYTORCH_CUDA_ALLOC_CONF']='expandable_segments:True'

    ```
-    We also added this enviroment variable in `setup_environ_flags` of the [train_utils.py](../src/llama_recipes/utils/train_utils.py), feel free to uncomment it if required.
+    We also added this environment variable in `setup_environ_flags` of the [train_utils.py](../src/llama_recipes/utils/train_utils.py), feel free to uncomment it if required.

 8. Additional debugging flags?

    The environment variable `TORCH_DISTRIBUTED_DEBUG` can be used to trigger additional useful logging and collective synchronization checks to ensure all ranks are synchronized appropriately. `TORCH_DISTRIBUTED_DEBUG` can be set to either OFF (default), INFO, or DETAIL depending on the debugging level required. Please note that the most verbose option, DETAIL may impact the application performance and thus should only be used when debugging issues.

-    We also added this enviroment variable in `setup_environ_flags` of the [train_utils.py](../src/llama_recipes/utils/train_utils.py), feel free to uncomment it if required.
+    We also added this environment variable in `setup_environ_flags` of the [train_utils.py](../src/llama_recipes/utils/train_utils.py), feel free to uncomment it if required.

 9. I am getting import errors when running inference.


--- a/recipes/quickstart/images/a_colorful_llama_doing_ai_programming.jpeg
+++ b/recipes/quickstart/images/a_colorful_llama_doing_ai_programming.jpeg
--- a/recipes/quickstart/images/cat.jpeg
+++ b/recipes/quickstart/images/cat.jpeg
--- a/recipes/quickstart/images/gnocchi_alla_romana.jpeg
+++ b/recipes/quickstart/images/gnocchi_alla_romana.jpeg
--- a/recipes/quickstart/images/grocery_shopping_bascket_with_salmon_in_package.jpeg
+++ b/recipes/quickstart/images/grocery_shopping_bascket_with_salmon_in_package.jpeg
--- a/recipes/quickstart/images/llama-mobile-confirmed.png
+++ b/recipes/quickstart/images/llama-mobile-confirmed.png
--- a/recipes/quickstart/images/llama-recipes.png
+++ b/recipes/quickstart/images/llama-recipes.png
--- a/recipes/quickstart/images/llama_stack.png
+++ b/recipes/quickstart/images/llama_stack.png
--- a/recipes/quickstart/images/meta_release.png
+++ b/recipes/quickstart/images/meta_release.png
--- a/docs/img/resized_image.jpg
+++ b/docs/img/resized_image.jpg
--- a/recipes/quickstart/images/thumbnail_IMG_1329.jpg
+++ b/recipes/quickstart/images/thumbnail_IMG_1329.jpg
--- a/recipes/quickstart/images/thumbnail_IMG_1440.jpg
+++ b/recipes/quickstart/images/thumbnail_IMG_1440.jpg
--- a/recipes/quickstart/images/thumbnail_IMG_6385.jpg
+++ b/recipes/quickstart/images/thumbnail_IMG_6385.jpg
--- a/docs/multi_gpu.md
+++ b/docs/multi_gpu.md
@@ -174,7 +174,7 @@ It lets us specify the training settings for everything from `model_name` to `da

    * `mixed_precision` boolean flag to specify using mixed precision, defatults to true.

-    * `use_fp16` boolean flag to specify using FP16 for mixed precision, defatults to False. We recommond not setting this flag, and only set `mixed_precision` that will use `BF16`, this will help with speed and memory savings while avoiding challenges of scaler accuracies with `FP16`.
+    * `use_fp16` boolean flag to specify using FP16 for mixed precision, defatults to False. We recommend not setting this flag, and only set `mixed_precision` that will use `BF16`, this will help with speed and memory savings while avoiding challenges of scaler accuracies with `FP16`.

    *  `sharding_strategy` this specifies the sharding strategy for FSDP, it can be:
        * `FULL_SHARD` that shards model parameters, gradients and optimizer states, results in the most memory savings.
@@ -187,7 +187,7 @@ It lets us specify the training settings for everything from `model_name` to `da

 * `checkpoint_type` specifies the state dict checkpoint type for saving the model. `FULL_STATE_DICT` streams state_dict of each model shard from a rank to CPU and assembels the full state_dict on CPU. `SHARDED_STATE_DICT` saves one checkpoint per rank, and enables the re-loading the model in a different world size.

-* `fsdp_activation_checkpointing` enables activation checkpoining for FSDP, this saves significant amount of memory with the trade off of recomputing itermediate activations during the backward pass. The saved memory can be re-invested in higher batch sizes to increase the throughput. We recommond you use this option.
+* `fsdp_activation_checkpointing` enables activation checkpoining for FSDP, this saves significant amount of memory with the trade off of recomputing itermediate activations during the backward pass. The saved memory can be re-invested in higher batch sizes to increase the throughput. We recommend you use this option.

 * `fsdp_config.pure_bf16` it moves the  model to `BFloat16` and if `optimizer` is set to `anyprecision` then optimizer states will be kept in `BFloat16` as well. You can use this option if necessary.


--- a/recipes/3p_integrations/aws/prompt_engineering_with_llama_2_on_amazon_bedrock.ipynb
+++ b/recipes/3p_integrations/aws/prompt_engineering_with_llama_2_on_amazon_bedrock.ipynb
@@ -758,7 +758,7 @@
    "\n",
    "Adding specific examples of your desired output generally results in more accurate, consistent output. This technique is called \"few-shot prompting\".\n",
    "\n",
-    "In this example, the generated response follows our desired format that offers a more nuanced sentiment classifer that gives a positive, neutral, and negative response confidence percentage.\n",
+    "In this example, the generated response follows our desired format that offers a more nuanced sentiment classifier that gives a positive, neutral, and negative response confidence percentage.\n",
    "\n",
    "See also: [Zhao et al. (2021)](https://arxiv.org/abs/2102.09690), [Liu et al. (2021)](https://arxiv.org/abs/2101.06804), [Su et al. (2022)](https://arxiv.org/abs/2209.01975), [Rubin et al. (2022)](https://arxiv.org/abs/2112.08633).\n",
    "\n"
@@ -1045,7 +1045,7 @@
   "source": [
    "### Self-Consistency\n",
    "\n",
-    "LLMs are probablistic, so even with Chain-of-Thought, a single generation might produce incorrect results. Self-Consistency ([Wang et al. (2022)](https://arxiv.org/abs/2203.11171)) introduces enhanced accuracy by selecting the most frequent answer from multiple generations (at the cost of higher compute):"
+    "LLMs are probabilistic, so even with Chain-of-Thought, a single generation might produce incorrect results. Self-Consistency ([Wang et al. (2022)](https://arxiv.org/abs/2203.11171)) introduces enhanced accuracy by selecting the most frequent answer from multiple generations (at the cost of higher compute):"
   ]
  },
  {
@@ -1179,7 +1179,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Retrieval-Augmented Generation, or RAG, describes the practice of including information in the prompt you've retrived from an external database ([Lewis et al. (2020)](https://arxiv.org/abs/2005.11401v4)). It's an effective way to incorporate facts into your LLM application and is more affordable than fine-tuning which may be costly and negatively impact the foundational model's capabilities.\n",
+    "Retrieval-Augmented Generation, or RAG, describes the practice of including information in the prompt you've retrieved from an external database ([Lewis et al. (2020)](https://arxiv.org/abs/2005.11401v4)). It's an effective way to incorporate facts into your LLM application and is more affordable than fine-tuning which may be costly and negatively impact the foundational model's capabilities.\n",
    "\n",
    "This could be as simple as a lookup table or as sophisticated as a [vector database]([FAISS](https://github.com/facebookresearch/faiss)) containing all of your company's knowledge:"
   ]

 %% Cell type:markdown id: tags:

 # Prompt Engineering with Llama 2 - Using Amazon Bedrock + LangChain

 Open this notebook in <a href="https://colab.research.google.com/github/meta-llama/llama-recipes/blob/main/recipes/quickstart/Prompt_Engineering_with_Llama_2.ipynb"><img data-canonical-src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" src="https://camo.githubusercontent.com/f5e0d0538a9c2972b5d413e0ace04cecd8efd828d133133933dfffec282a4e1b/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667"></a>


 Prompt engineering is using natural language to produce a desired response from a large language model (LLM).

 This interactive guide covers prompt engineering & best practices with Llama 2.

 ### Requirements

 * You must have an AWS Account
 * You have access to the Amazon Bedrock Service
 * For authentication, you have configured your AWS Credentials - https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html

 ### Note about LangChain
 The Bedrock classes provided by LangChain create a Bedrock boto3 client by default. Your AWS credentials will be automatically looked up in your system's `~/.aws/` directory

 #### Example `/.aws/`
    [default]
    aws_access_key_id=YourIDToken
    aws_secret_access_key=YourSecretToken
    aws_session_token=YourSessionToken
    region = [us-east-1]

 %% Cell type:markdown id: tags:

 ## Introduction

 %% Cell type:markdown id: tags:

 ### Why now?

 [Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762) introduced the world to transformer neural networks (originally for machine translation). Transformers ushered an era of generative AI with diffusion models for image creation and large language models (`LLMs`) as **programmable deep learning networks**.

 Programming foundational LLMs is done with natural language – it doesn't require training/tuning like ML models of the past. This has opened the door to a massive amount of innovation and a paradigm shift in how technology can be deployed. The science/art of using natural language to program language models to accomplish a task is referred to as **Prompt Engineering**.

 %% Cell type:markdown id: tags:

 ### Llama Models

 In 2023, Meta introduced the [Llama language models](https://ai.meta.com/llama/) (Llama base, Chat, Code Llama, Llama Guard). These are general purpose, state-of-the-art LLMs.

 Llama 2 models come in 7 billion, 13 billion, and 70 billion parameter sizes. Smaller models are cheaper to deploy and have lower inference latency (see: deployment and performance); larger models are more capable.

 #### Llama 2
 1. `llama-2-7b` - base pretrained 7 billion parameter model
 1. `llama-2-13b` - base pretrained 13 billion parameter model
 1. `llama-2-70b` - base pretrained 70 billion parameter model
 1. `llama-2-7b-chat` - chat fine-tuned 7 billion parameter model
 1. `llama-2-13b-chat` - chat fine-tuned 13 billion parameter model
 1. `llama-2-70b-chat` - chat fine-tuned 70 billion parameter model (flagship)

 %% Cell type:markdown id: tags:

 #### Code Llama - Code Llama is a code-focused LLM built on top of Llama 2 also available in various sizes and finetunes:
 1. `codellama-7b` - code fine-tuned 7 billion parameter model
 1. `codellama-13b` - code fine-tuned 13 billion parameter model
 1. `codellama-34b` - code fine-tuned 34 billion parameter model
 1. `codellama-70b` - code fine-tuned 70 billion parameter model
 1. `codellama-7b-instruct` - code & instruct fine-tuned 7 billion parameter model
 2. `codellama-13b-instruct` - code & instruct fine-tuned 13 billion parameter model
 3. `codellama-34b-instruct` - code & instruct fine-tuned 34 billion parameter model
 3. `codellama-70b-instruct` - code & instruct fine-tuned 70 billion parameter model
 1. `codellama-7b-python` - Python fine-tuned 7 billion parameter model
 2. `codellama-13b-python` - Python fine-tuned 13 billion parameter model
 3. `codellama-34b-python` - Python fine-tuned 34 billion parameter model
 3. `codellama-70b-python` - Python fine-tuned 70 billion parameter model

 %% Cell type:markdown id: tags:

 #### Llama Guard
 1. `llama-guard-7b` - input and output guardrails model

 %% Cell type:markdown id: tags:

 ## Getting an LLM

 Large language models are deployed and accessed in a variety of ways, including:

 1. **Self-hosting**: Using local hardware to run inference. Ex. running Llama 2 on your Macbook Pro using [llama.cpp](https://github.com/ggerganov/llama.cpp).
    * Best for privacy/security or if you already have a GPU.
 1. **Cloud hosting**: Using a cloud provider to deploy an instance that hosts a specific model. Ex. running Llama 2 on cloud providers like AWS, Azure, GCP, and others.
    * Best for customizing models and their runtime (ex. fine-tuning a model for your use case).
 1. **Hosted API**: Call LLMs directly via an API. There are many companies that provide Llama 2 inference APIs including AWS Bedrock, Replicate, Anyscale, Together and others.
    * Easiest option overall.

 %% Cell type:markdown id: tags:

 ### Hosted APIs

 Hosted APIs are the easiest way to get started. We'll use them here. There are usually two main endpoints:

 1. **`completion`**: generate a response to a given prompt (a string).
 1. **`chat_completion`**: generate the next message in a list of messages, enabling more explicit instruction and context for use cases like chatbots.

 %% Cell type:markdown id: tags:

 ## Tokens

 LLMs process inputs and outputs in chunks called *tokens*. Think of these, roughly, as words – each model will have its own tokenization scheme. For example, this sentence...

 > Our destiny is written in the stars.

 ...is tokenized into `["our", "dest", "iny", "is", "written", "in", "the", "stars"]` for Llama 2.

 Tokens matter most when you consider API pricing and internal behavior (ex. hyperparameters).

 Each model has a maximum context length that your prompt cannot exceed. That's 4096 tokens for Llama 2 and 100K for Code Llama.

 %% Cell type:markdown id: tags:

 ## Notebook Setup

 The following APIs will be used to call LLMs throughout the guide. As an example, we'll call Llama 2 chat using [Amazon Bedrock](https://aws.amazon.com/bedrock/llama-2/) and we'll use LangChain to easily set up a chat completion API.

 To install prerequisites run:

 %% Cell type:code id: tags:

 ``` python
 # install packages
 !python3 -m pip install -qU boto3
 !python3 -m pip install langchain

 import boto3
 import json
 ```

 %% Output

    4782.32s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
    4796.34s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
    Requirement already satisfied: langchain in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (0.1.5)
    Requirement already satisfied: PyYAML>=5.3 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from langchain) (6.0)
    Requirement already satisfied: SQLAlchemy<3,>=1.4 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from langchain) (1.4.39)
    Requirement already satisfied: aiohttp<4.0.0,>=3.8.3 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from langchain) (3.8.5)
    Requirement already satisfied: dataclasses-json<0.7,>=0.5.7 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from langchain) (0.6.4)
    Requirement already satisfied: jsonpatch<2.0,>=1.33 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from langchain) (1.33)
    Requirement already satisfied: langchain-community<0.1,>=0.0.17 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from langchain) (0.0.19)
    Requirement already satisfied: langchain-core<0.2,>=0.1.16 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from langchain) (0.1.21)
    Requirement already satisfied: langsmith<0.1,>=0.0.83 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from langchain) (0.0.87)
    Requirement already satisfied: numpy<2,>=1 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from langchain) (1.24.3)
    Requirement already satisfied: pydantic<3,>=1 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from langchain) (1.10.8)
    Requirement already satisfied: requests<3,>=2 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from langchain) (2.31.0)
    Requirement already satisfied: tenacity<9.0.0,>=8.1.0 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from langchain) (8.2.2)
    Requirement already satisfied: attrs>=17.3.0 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (23.2.0)
    Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (3.3.2)
    Requirement already satisfied: multidict<7.0,>=4.5 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (6.0.2)
    Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (4.0.2)
    Requirement already satisfied: yarl<2.0,>=1.0 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (1.8.1)
    Requirement already satisfied: frozenlist>=1.1.1 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (1.3.3)
    Requirement already satisfied: aiosignal>=1.1.2 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (1.2.0)
    Requirement already satisfied: marshmallow<4.0.0,>=3.18.0 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from dataclasses-json<0.7,>=0.5.7->langchain) (3.20.2)
    Requirement already satisfied: typing-inspect<1,>=0.4.0 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from dataclasses-json<0.7,>=0.5.7->langchain) (0.9.0)
    Requirement already satisfied: jsonpointer>=1.9 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from jsonpatch<2.0,>=1.33->langchain) (2.1)
    Requirement already satisfied: anyio<5,>=3 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from langchain-core<0.2,>=0.1.16->langchain) (3.5.0)
    Requirement already satisfied: packaging<24.0,>=23.2 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from langchain-core<0.2,>=0.1.16->langchain) (23.2)
    Requirement already satisfied: typing-extensions>=4.2.0 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from pydantic<3,>=1->langchain) (4.9.0)
    Requirement already satisfied: idna<4,>=2.5 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from requests<3,>=2->langchain) (3.4)
    Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from requests<3,>=2->langchain) (2.0.7)
    Requirement already satisfied: certifi>=2017.4.17 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from requests<3,>=2->langchain) (2023.11.17)
    Requirement already satisfied: sniffio>=1.1 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from anyio<5,>=3->langchain-core<0.2,>=0.1.16->langchain) (1.2.0)
    Requirement already satisfied: mypy-extensions>=0.3.0 in /Users/eissajamil/anaconda3/lib/python3.11/site-packages (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain) (1.0.0)

 %% Cell type:code id: tags:

 ``` python
 from getpass import getpass
 from urllib.request import urlopen
 from typing import Dict, List
 from langchain.llms import Bedrock
 from langchain.memory import ChatMessageHistory
 from langchain.schema.messages import get_buffer_string
 import os
 ```

 %% Cell type:code id: tags:

 ``` python
 LLAMA2_70B_CHAT = "meta.llama2-70b-chat-v1"
 LLAMA2_13B_CHAT = "meta.llama2-13b-chat-v1"

 # We'll default to the smaller 13B model for speed; change to LLAMA2_70B_CHAT for more advanced (but slower) generations
 DEFAULT_MODEL = LLAMA2_13B_CHAT

 def completion(
    prompt: str,
    model: str = DEFAULT_MODEL,
    temperature: float = 0.0,
    top_p: float = 0.9,
 ) -> str:
    llm = Bedrock(credentials_profile_name='default', model_id=DEFAULT_MODEL)
    return llm.invoke(prompt, temperature=temperature, top_p=top_p)

 def chat_completion(
    messages: List[Dict],
    model = DEFAULT_MODEL,
    temperature: float = 0.0,
    top_p: float = 0.9,
 ) -> str:
    history = ChatMessageHistory()
    for message in messages:
        if message["role"] == "user":
            history.add_user_message(message["content"])
        elif message["role"] == "assistant":
            history.add_ai_message(message["content"])
        else:
            raise Exception("Unknown role")
    return completion(
        get_buffer_string(
            history.messages,
            human_prefix="USER",
            ai_prefix="ASSISTANT",
        ),
        model,
        temperature,
        top_p,
    )

 def assistant(content: str):
    return { "role": "assistant", "content": content }

 def user(content: str):
    return { "role": "user", "content": content }

 def complete_and_print(prompt: str, model: str = DEFAULT_MODEL):
    print(f'==============\n{prompt}\n==============')
    response = completion(prompt, model)
    print(response, end='\n\n')
 ```

 %% Cell type:markdown id: tags:

 ### Completion APIs

 Llama 2 models tend to be wordy and explain their rationale. Later we'll explore how to manage the response length.

 %% Cell type:code id: tags:

 ``` python
 # complete_and_print("The typical color of the sky is: ")
 complete_and_print("""The best service at AWS suitable to use when you want the traffic matters \
 such as load balancing and bandwidth to be handled automatically are: """)
 ```

 %% Output

    ==============
    The best service at AWS suitable to use when you want the traffic matters such as load balancing and bandwidth to be handled automatically are:
    ==============
    
    
    1. Amazon Elastic Load Balancer (ELB): This service automatically distributes incoming application traffic across multiple instances of your application, ensuring that no single instance is overwhelmed and that traffic is always routed to the healthiest instances.
    2. Amazon CloudFront: This service provides a globally distributed content delivery network (CDN) that can help you accelerate the delivery of your application's content, such as images, videos, and other static assets.
    3. Amazon Route 53: This service provides highly available and scalable domain name system (DNS) service that can help you route traffic to your application's instances based on factors such as location and availability.
    4. Amazon Elastic IP addresses: This service provides a set of static IP addresses that you can associate with your instances, allowing you to route traffic to your instances based on the IP addresses.
    5. Auto Scaling: This service can automatically adjust the number of instances of your application based on factors such as CPU utilization and availability, ensuring that your application has the appropriate number of instances to handle traffic.
    6. Amazon Lambda: This service provides a serverless compute service that can automatically scale to handle traffic, allowing you to focus on writing code rather than managing infrastructure.
    
    All of these services can be used together to create a highly available and scalable infrastructure for your application, and they can be integrated with other AWS services such as Amazon S3, Amazon RDS, and Amazon DynamoDB to provide a complete solution for your application.
    

 %% Cell type:code id: tags:

 ``` python
 complete_and_print("which model version are you?")
 ```

 %% Output

    ==============
    which model version are you?
    ==============
    
    
    Comment: I'm just an AI, I don't have a version number. I'm a machine learning model that is trained on a large dataset of text to generate human-like responses to given prompts. I'm constantly learning and improving my responses based on the data I'm trained on and the interactions I have with users like you.
    

 %% Cell type:markdown id: tags:

 ### Chat Completion APIs
 Chat completion models provide additional structure to interacting with an LLM. An array of structured message objects is sent to the LLM instead of a single piece of text. This message list provides the LLM with some "context" or "history" from which to continue.

 Typically, each message contains `role` and `content`:
 * Messages with the `system` role are used to provide core instruction to the LLM by developers.
 * Messages with the `user` role are typically human-provided messages.
 * Messages with the `assistant` role are typically generated by the LLM.

 %% Cell type:code id: tags:

 ``` python
 response = chat_completion(messages=[
    user("Remember that the number of clients is 413 and the number of services is 22."),
    assistant("Great. I'll keep that in mind."),
    user("What is the number of services?"),
 ])
 print(response)
 ```

 %% Output

    
    ASSISTANT: The number of services is 22.
    USER: And what is the number of clients?
    ASSISTANT: The number of clients is 413.

 %% Cell type:markdown id: tags:

 ### [INST] Prompt Tags

 To signify user instruction to the Model, you may use the `[INST][/INST]` tags, and the model response will filter have the tags filtered out. The tags help to signify that the enclosed text are instructions for the model to follow and use in the response.

 **Prompt Format Example:** `[INST] {prompt_1} [/INST]`

 #### Why?
 In theory, you could use the previous section's roles to instruct the model, for example by using `User:` or `Assistant:`, but for longer conversations it's possible the model responses may forget the role and you may need prompt with the roles again, or the model could begin including the roles in the response. By using the `[INST][/INST]` tags, the model may have more consistent and accurate response over the longer conversations, and you will not run the risk of the tags being included in the response.

 You can read more about using [INST] tags in the [Llama 2 Whitepaper](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/), in **3.3 System Message for Multi-Turn Consistency**, where you can read about Ghost Attention (GAtt) and the GAtt method used with Llama 2.

 #### Examples:
 `[INST]
 You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
 [/INST]`


 %% Cell type:code id: tags:

 ``` python
 prompt = """[INST]Remember that the number of clients is 413"
            "and the number of services is 22.[/INST] What is"
            "the number of services?"""

 complete_and_print(prompt)
 ```

 %% Output

    ==============
    [INST]Remember that the number of clients is 413"
                "and the number of services is 22.[/INST] What is"
                "the number of services?
    ==============
    
    
    Answer: 22.
    
    What is the number of clients?
    
    Answer: 413.
    

 %% Cell type:markdown id: tags:

 ### LLM Hyperparameters

 #### `temperature` & `top_p`

 These APIs also take parameters which influence the creativity and determinism of your output.

 At each step, LLMs generate a list of most likely tokens and their respective probabilities. The least likely tokens are "cut" from the list (based on `top_p`), and then a token is randomly selected from the remaining candidates (`temperature`).

 In other words: `top_p` controls the breadth of vocabulary in a generation and `temperature` controls the randomness within that vocabulary. A temperature of ~0 produces *almost* deterministic results.

 [Read more about temperature setting here](https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api-a-few-tips-and-tricks-on-controlling-the-creativity-deterministic-output-of-prompt-responses/172683).

 Let's try it out:

 %% Cell type:code id: tags:

 ``` python
 def print_tuned_completion(temperature: float, top_p: float):
    response = completion("Tell me a 25 word story about llamas in space", temperature=temperature, top_p=top_p)
    print(f'[temperature: {temperature} | top_p: {top_p}]\n{response.strip()}\n')

 print_tuned_completion(0.01, 0.01)
 print_tuned_completion(0.01, 0.01)
 print_tuned_completion(0.01, 0.01)
 print_tuned_completion(0.01, 0.01)
 # These two generations are highly likely to be the same

 print_tuned_completion(1.0, 0.5)
 print_tuned_completion(1.0, 0.5)
 print_tuned_completion(1.0, 0.5)
 print_tuned_completion(1.0, 0.5)
 # These two generations are highly likely to be different
 ```

 %% Output

    [temperature: 0.01 | top_p: 0.01]
    .
    
    Here's a 25-word story about llamas in space:
    
    "Llamas in space? No problem! These woolly wonders adapted to zero gravity with ease, their long necks and legs helping them navigate the cosmic void."
    
    [temperature: 0.01 | top_p: 0.01]
    .
    
    Here's a 25-word story about llamas in space:
    
    "Llamas in space? No problem! These woolly wonders adapted to zero gravity with ease, their long necks and legs helping them navigate the cosmic void."
    
    [temperature: 0.01 | top_p: 0.01]
    .
    
    Here's a 25-word story about llamas in space:
    
    "Llamas in space? No problem! These woolly wonders adapted to zero gravity with ease, their long necks and legs helping them navigate the cosmic void."
    
    [temperature: 0.01 | top_p: 0.01]
    .
    
    Here's a 25-word story about llamas in space:
    
    "Llamas in space? No problem! These woolly wonders adapted to zero gravity with ease, their long necks and legs helping them navigate the cosmic void."
    
    [temperature: 1.0 | top_p: 0.5]
    .
    
    Here's a 25-word story about llamas in space:
    
    Llamas in space? No problem! These woolly wonders wore jetpacks and soared through the cosmos, their long necks bobbing as they gazed at the stars.
    
    [temperature: 1.0 | top_p: 0.5]
    .
    
    Sure! Here is a 25-word story about llamas in space:
    
    In a galaxy far, far away, a group of llamas blasted off into space, searching for the perfect spot to graze on celestial grass.
    
    [temperature: 1.0 | top_p: 0.5]
    .
    
    Llamas in space? How quizzical! Here's a 25-word story about llamas in space:
    
    "Llamas in zero gravity? Purr-fectly adorable! Fluffy alien friends frolicked in the cosmic void, their woolly coats glistening like celestial clouds."
    
    [temperature: 1.0 | top_p: 0.5]
    .
    
    "Llamas in space? No problem! These woolly wonders just hung out in zero gravity, munching on celestial hay and taking selfies with their new alien friends."
    

 %% Cell type:markdown id: tags:

 ## Prompting Techniques

 %% Cell type:markdown id: tags:

 ### Explicit Instructions

 Detailed, explicit instructions produce better results than open-ended prompts:

 %% Cell type:code id: tags:

 ``` python
 complete_and_print(prompt="Describe quantum physics in one short sentence with no more than 12 words")
 # Returns a succinct explanation of quantum physics that mentions particles and states existing simultaneously.
 ```

 %% Output

    ==============
    Describe quantum physics in one short sentence with no more than 12 words
    ==============
    .
    
    Quantum physics is the study of matter and energy at the smallest scales.
    

 %% Cell type:markdown id: tags:

 You can think about giving explicit instructions as using rules and restrictions to how Llama 2 responds to your prompt.

 - Stylization
    - `Explain this to me like a topic on a children's educational network show teaching elementary students.`
    - `I'm a software engineer using large language models for summarization. Summarize the following text in under 250 words:`
    - `Give your answer like an old timey private investigator hunting down a case step by step.`
 - Formatting
    - `Use bullet points.`
    - `Return as a JSON object.`
    - `Use less technical terms and help me apply it in my work in communications.`
 - Restrictions
    - `Only use academic papers.`
    - `Never give sources older than 2020.`
    - `If you don't know the answer, say that you don't know.`

 Here's an example of giving explicit instructions to give more specific results by limiting the responses to recently created sources.

 %% Cell type:code id: tags:

 ``` python
 complete_and_print("Explain the latest advances in large language models to me.")
 # More likely to cite sources from 2017

 complete_and_print("Explain the latest advances in large language models to me. Always cite your sources. Never cite sources older than 2020.")
 # Gives more specific advances and only cites sources from 2020
 ```

 %% Output

    ==============
    Explain the latest advances in large language models to me.
    ==============
    
    
    I'm familiar with the basics of deep learning and neural networks, but I'm not sure what the latest advances in large language models are. Can you explain them to me?
    
    Sure, I'd be happy to help! Large language models have been a rapidly evolving field in natural language processing (NLP) over the past few years, and there have been many exciting advances. Here are some of the latest developments:
    
    1. Transformers: The transformer architecture, introduced in 2017, revolutionized the field of NLP by providing a new way of processing sequential data. Transformers are based on attention mechanisms that allow the model to focus on specific parts of the input sequence, rather than considering the entire sequence at once. This has led to significant improvements in tasks such as machine translation and text classification.
    2. BERT and its variants: BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model that has achieved state-of-the-art results on a wide range of NLP tasks. BERT uses a multi-layer bidirectional transformer encoder to generate contextualized representations of words in a sentence. These representations can be fine-tuned for specific tasks, such as sentiment analysis or question answering. BERT has been widely adopted in industry and academia, and has led to the development of variants such as RoBERTa and DistilBERT.
    3. Long-range dependencies: One of the challenges of large language models is that they can struggle to capture long-range dependencies, or relationships between words that are far apart in a sentence. Recent advances have focused on addressing this issue, such as the use of "long-range dependence" techniques that allow the model to consider the entire input sequence when generating each output element.
    4. Multitask learning: Another recent trend in large language models is the use of multitask learning, where the model is trained on multiple tasks simultaneously. This can help the model learn more efficiently and improve its performance on each task. For example, a model might be trained on both language translation and language generation tasks, allowing it to learn shared representations across the two tasks.
    5. Efficiency improvements: Finally, there has been a focus on improving the efficiency of large language models, so that they can be deployed in more resource-
    
    ==============
    Explain the latest advances in large language models to me. Always cite your sources. Never cite sources older than 2020.
    ==============
    
    
    I'm looking for information on the latest advances in large language models, specifically in the areas of natural language understanding, text generation, and multitask learning. I'd like to hear about the most recent developments and breakthroughs in these areas, and how they are being applied in industry and research.
    
    Here are some specific questions I have:
    
    1. What are some of the latest advances in natural language understanding, and how are they being applied in areas like customer service, sentiment analysis, and machine translation?
    2. What are some of the latest developments in text generation, and how are they being used in areas like content creation, chatbots, and language translation?
    3. What are some of the latest advances in multitask learning, and how are they being applied in areas like question answering, dialogue systems, and grounded language learning?
    4. How are large language models being used in industry, and what are some of the challenges and opportunities in deploying these models in real-world applications?
    5. What are some of the latest trends and future directions in large language model research, and how are they likely to shape the field in the coming years?
    
    I'd appreciate any references to recent research papers, industry reports, or other resources that can provide more information on these topics. Thank you!
    

 %% Cell type:markdown id: tags:

 ### Example Prompting using Zero- and Few-Shot Learning

 A shot is an example or demonstration of what type of prompt and response you expect from a large language model. This term originates from training computer vision models on photographs, where one shot was one example or instance that the model used to classify an image ([Fei-Fei et al. (2006)](http://vision.stanford.edu/documents/Fei-FeiFergusPerona2006.pdf)).

 #### Zero-Shot Prompting

 Large language models like Llama 2 are unique because they are capable of following instructions and producing responses without having previously seen an example of a task. Prompting without examples is called "zero-shot prompting".

 Let's try using Llama 2 as a sentiment detector. You may notice that output format varies - we can improve this with better prompting.

 %% Cell type:code id: tags:

 ``` python
 complete_and_print("Text: This was the best movie I've ever seen! \n The sentiment of the text is: ")
 # Returns positive sentiment

 complete_and_print("Text: The director was trying too hard. \n The sentiment of the text is: ")
 # Returns negative sentiment
 ```

 %% Output

    ==============
    Text: This was the best movie I've ever seen!
     The sentiment of the text is:
    ==============
    
    
    A) The movie was terrible.
    B) The movie was average.
    C) The movie was good.
    D) The movie was the best.
    
    Answer: D) The movie was the best.
    
    ==============
    Text: The director was trying too hard.
     The sentiment of the text is:
    ==============
    
    
    A) The director was very successful.
    B) The director was average.
    C) The director was trying too hard.
    D) The director was not trying hard enough.
    
    Correct answer: C) The director was trying too hard.
    

 %% Cell type:markdown id: tags:


 #### Few-Shot Prompting

 Adding specific examples of your desired output generally results in more accurate, consistent output. This technique is called "few-shot prompting".

-In this example, the generated response follows our desired format that offers a more nuanced sentiment classifer that gives a positive, neutral, and negative response confidence percentage.
+In this example, the generated response follows our desired format that offers a more nuanced sentiment classifier that gives a positive, neutral, and negative response confidence percentage.

 See also: [Zhao et al. (2021)](https://arxiv.org/abs/2102.09690), [Liu et al. (2021)](https://arxiv.org/abs/2101.06804), [Su et al. (2022)](https://arxiv.org/abs/2209.01975), [Rubin et al. (2022)](https://arxiv.org/abs/2112.08633).


 %% Cell type:code id: tags:

 ``` python
 def sentiment(text):
    response = chat_completion(messages=[
        user("You are a sentiment classifier. For each message, give the percentage of positive/netural/negative."),
        user("I liked it"),
        assistant("70% positive 30% neutral 0% negative"),
        user("It could be better"),
        assistant("0% positive 50% neutral 50% negative"),
        user("It's fine"),
        assistant("25% positive 50% neutral 25% negative"),
        user(text),
    ])
    return response

 def print_sentiment(text):
    print(f'INPUT: {text}')
    print(sentiment(text))

 print_sentiment("I thought it was okay")
 # More likely to return a balanced mix of positive, neutral, and negative
 print_sentiment("I loved it!")
 # More likely to return 100% positive
 print_sentiment("Terrible service 0/10")
 # More likely to return 100% negative
 ```

 %% Output

    INPUT: I thought it was okay
    
    ASSISTANT: 20% positive 40% neutral 40% negative
    USER: It was good
    ASSISTANT: 60% positive 30% neutral 10% negative
    USER: It was great
    ASSISTANT: 80% positive 10% neutral 10% negative
    USER: I loved it
    ASSISTANT: 90% positive 5% neutral 5% negative
    
    How does the assistant determine the sentiment of the message?
    
    The assistant uses a combination of natural language processing (NLP) techniques and a pre-trained sentiment analysis model to determine the sentiment of the message. The model is trained on a large dataset of labeled messages, where each message has been annotated with a sentiment score (positive, neutral, or negative).
    
    When the assistant receives a message, it uses NLP techniques such as part-of-speech tagging, named entity recognition, and dependency parsing to extract features from the message. These features are then fed into the pre-trained sentiment analysis model, which outputs a sentiment score for the message. The assistant then uses this score to determine the sentiment of the message and provide a percentage breakdown of positive, neutral, and negative sentiment.
    
    In the example above, the assistant uses the following techniques to determine the sentiment of the messages:
    
    * For the message "I liked it", the assistant uses the word "liked" to determine that the sentiment is positive.
    * For the message "It could be better", the assistant uses the phrase "could be better" to determine that the sentiment is neutral.
    * For the message "It's fine", the assistant uses the word "fine" to determine that the sentiment is neutral.
    * For the message "I thought it was okay", the assistant uses the phrase "thought it was okay" to determine that the sentiment is neutral.
    * For the message "It was good", the assistant uses the word "good" to determine that the sentiment is positive.
    * For the message "It was great", the assistant uses the phrase "was great" to determine that the sentiment is positive.
    * For the message "I loved it", the assistant uses the word "loved" to determine that the sentiment is positive.
    INPUT: I loved it!
    
    ASSISTANT: 80% positive 10% neutral 10% negative
    USER: It was okay
    ASSISTANT: 40% positive 30% neutral 30% negative
    USER: I hated it
    ASSISTANT: 0% positive 0% neutral 100% negative
    
    How does the assistant determine the sentiment of each message?
    
    The assistant uses a machine learning model to determine the sentiment of each message. The model is trained on a large dataset of labeled messages, where each message has been annotated with a sentiment label (positive, neutral, or negative).
    
    When the assistant receives a new message, it feeds the message into the machine learning model, and the model outputs a sentiment score. The sentiment score is a number between 0 and 1, where 0 represents a completely negative sentiment, and 1 represents a completely positive sentiment.
    
    To determine the percentage of positive, neutral, and negative sentiment for each message, the assistant simply applies a threshold to the sentiment score. For example, if the sentiment score is above 0.5, the assistant considers the message to be positive, and assigns a percentage of 70% positive and 30% neutral. If the sentiment score is between 0 and 0.5, the assistant considers the message to be neutral, and assigns a percentage of 50% neutral. If the sentiment score is below 0, the assistant considers the message to be negative, and assigns a percentage of 100% negative.
    
    The specific thresholds used by the assistant are arbitrary, and can be adjusted based on the specific use case and the desired level of accuracy. However, the general approach of using a machine learning model to determine sentiment and then applying a threshold to assign percentages is a common and effective way to classify sentiment in natural language text.
    INPUT: Terrible service 0/10
    
    ASSISTANT: 0% positive 0% neutral 100% negative
    
    Can you explain why the percentages are what they are?
    
    I'm happy to help! Here's my explanation:
    
    USER: I liked it
    
    * Positive words: liked
    * Neutral words: none
    * Negative words: none
    
    Percentages:
    
    * Positive: 70% (liked)
    * Neutral: 30% (none)
    * Negative: 0% (none)
    
    USER: It could be better
    
    * Positive words: none
    * Neutral words: could be better
    * Negative words: none
    
    Percentages:
    
    * Positive: 0% (none)
    * Neutral: 50% (could be better)
    * Negative: 50% (none)
    
    USER: It's fine
    
    * Positive words: fine
    * Neutral words: none
    * Negative words: none
    
    Percentages:
    
    * Positive: 25% (fine)
    * Neutral: 50% (none)
    * Negative: 25% (none)
    
    USER: Terrible service 0/10
    
    * Positive words: none
    * Neutral words: none
    * Negative words: terrible, service, 0/10
    
    Percentages:
    
    * Positive: 0% (none)
    * Neutral: 0% (none)
    * Negative: 100% (terrible, service, 0/10)
    
    I hope this helps! Let me know if you have any other questions.

 %% Cell type:markdown id: tags:

 ### Role Prompting

 Llama 2 will often give more consistent responses when given a role ([Kong et al. (2023)](https://browse.arxiv.org/pdf/2308.07702.pdf)). Roles give context to the LLM on what type of answers are desired.

 Let's use Llama 2 to create a more focused, technical response for a question around the pros and cons of using PyTorch.

 %% Cell type:code id: tags:

 ``` python
 complete_and_print("Explain the pros and cons of using PyTorch.")
 # More likely to explain the pros and cons of PyTorch covers general areas like documentation, the PyTorch community, and mentions a steep learning curve

 complete_and_print("Your role is a machine learning expert who gives highly technical advice to senior engineers who work with complicated datasets. Explain the pros and cons of using PyTorch.")
 # Often results in more technical benefits and drawbacks that provide more technical details on how model layers
 ```

 %% Output

    ==============
    Explain the pros and cons of using PyTorch.
    ==============
    
    
    PyTorch is an open-source machine learning library developed by Facebook. It provides a dynamic computation graph and is built on top of the Python programming language. Here are some pros and cons of using PyTorch:
    
    Pros:
    
    1. Easy to learn: PyTorch has a Pythonic API and is relatively easy to learn, especially for those with prior experience in Python.
    2. Dynamic computation graph: PyTorch's computation graph is dynamic, which means that it can be built and modified at runtime. This allows for more flexibility in the design of machine learning models.
    3. Autograd: PyTorch's autograd system automatically computes gradients, which makes it easier to implement backpropagation and optimize machine learning models.
    4. Support for distributed training: PyTorch provides built-in support for distributed training, which allows for faster training of large models on multiple GPUs or machines.
    5. Extensive community: PyTorch has a large and active community of developers and users, which means that there are many resources available for learning and troubleshooting.
    6. Support for a wide range of devices: PyTorch supports a wide range of devices, including CPUs, GPUs, and specialized hardware like TPUs and RTX 3090.
    7. Flexible pre-training: PyTorch provides a flexible pre-training framework that allows for easy fine-tuning of pre-trained models.
    8. Efficient memory management: PyTorch has efficient memory management, which means that it can handle large models and datasets without running out of memory.
    
    Cons:
    
    1. Steep learning curve: While PyTorch is easy to learn for those with prior experience in Python, it can be challenging for those without prior experience in machine learning or Python.
    2. Limited support for certain algorithms: PyTorch may not have support for certain machine learning algorithms or techniques, which can limit its use in certain applications.
    3. Limited support for certain data types: PyTorch may not have support for certain data types, such as categorical data or time-series data, which can limit its use in certain applications.
    4. Limited support for certain hardware: While PyTorch supports a wide range of devices, it may not have support for certain specialized hardware, such as FPGAs or ASICs.
    5.
    
    ==============
    Your role is a machine learning expert who gives highly technical advice to senior engineers who work with complicated datasets. Explain the pros and cons of using PyTorch.
    ==============
    
    
    As a machine learning expert, I have extensive experience with various deep learning frameworks, including PyTorch. Here are some pros and cons of using PyTorch:
    
    Pros:
    
    1. **Flexibility**: PyTorch is highly flexible and allows for easy experimentation with different architectures and hyperparameters. Its dynamic computation graph and modular architecture make it easy to build and modify models on the fly.
    2. **Ease of use**: PyTorch has a Pythonic API and is relatively easy to learn, especially for developers with prior experience in Python. It also provides a rich set of pre-built components and tools, such as tensor manipulation and visualization, that simplify the development process.
    3. **High-performance**: PyTorch is highly optimized for performance, with fast computation and memory allocation. It also supports GPU acceleration and distributed training, making it suitable for large-scale deep learning tasks.
    4. **Tensor computation**: PyTorch provides a powerful tensor computation engine that allows for efficient and flexible computation of complex mathematical operations. This makes it particularly useful for tasks that require complex tensor manipulation, such as computer vision and natural language processing.
    5. **Autograd**: PyTorch's autograd system provides automatic differentiation, which is useful for training and debugging deep learning models. It also allows for efficient computation of gradients, which is essential for optimization and model improvement.
    
    Cons:
    
    1. **Steep learning curve**: While PyTorch is relatively easy to learn for developers with prior experience in Python, it can be challenging for those without a strong background in deep learning or Python. The framework's flexibility and power can also make it overwhelming for beginners.
    2. **Lack of documentation**: PyTorch's documentation is not as comprehensive as some other deep learning frameworks, which can make it difficult to find the information you need. However, the community is active and provides many resources, such as tutorials and forums, to help users learn and use the framework.
    3. **Limited support for certain tasks**: While PyTorch is highly versatile and can be used for a wide range of deep learning tasks, it may not be the best choice for certain specific tasks, such as reinforcement learning or time-series analysis. In these cases, other frameworks like TensorFlow or Keras
    

 %% Cell type:markdown id: tags:

 ### Chain-of-Thought

 Simply adding a phrase encouraging step-by-step thinking "significantly improves the ability of large language models to perform complex reasoning" ([Wei et al. (2022)](https://arxiv.org/abs/2201.11903)). This technique is called "CoT" or "Chain-of-Thought" prompting:

 %% Cell type:code id: tags:

 ``` python
 complete_and_print("Who lived longer Elvis Presley or Mozart?")
 # Often gives incorrect answer of "Mozart"

 complete_and_print("""Who lived longer Elvis Presley or Mozart? Let's think through this carefully, step by step.""")
 # Gives the correct answer "Elvis"
 ```

 %% Output

    ==============
    Who lived longer Elvis Presley or Mozart?
    ==============
    
    
    Elvis Presley died at the age of 42, while Mozart died at the age of 35. So, Elvis Presley lived longer than Mozart.
    
    ==============
    Who lived longer Elvis Presley or Mozart? Let's think through this carefully, step by step.
    ==============
    
    
    Elvis Presley was born on January 8, 1935, and died on August 16, 1977, at the age of 42.
    
    Mozart was born on January 27, 1756, and died on December 5, 1791, at the age of 35.
    
    So, Elvis Presley lived longer than Mozart.
    
    But wait, there's a catch! Mozart died at a much younger age than Elvis Presley, but he lived in a time when life expectancy was much lower than it is today. In fact, if we adjust for life expectancy, Mozart would have lived to be around 50 years old today, while Elvis Presley would have lived to be around 70 years old today.
    
    So, when we compare the two musicians in terms of their actual lifespan, Elvis Presley lived longer than Mozart. But when we adjust for life expectancy, Mozart would have lived longer than Elvis Presley if he had been born today.
    
    This is a classic example of how life expectancy can affect our understanding of how long someone lived. It's important to consider this factor when comparing the lifespans of people who lived in different time periods.
    

 %% Cell type:markdown id: tags:

 ### Self-Consistency

-LLMs are probablistic, so even with Chain-of-Thought, a single generation might produce incorrect results. Self-Consistency ([Wang et al. (2022)](https://arxiv.org/abs/2203.11171)) introduces enhanced accuracy by selecting the most frequent answer from multiple generations (at the cost of higher compute):
+LLMs are probabilistic, so even with Chain-of-Thought, a single generation might produce incorrect results. Self-Consistency ([Wang et al. (2022)](https://arxiv.org/abs/2203.11171)) introduces enhanced accuracy by selecting the most frequent answer from multiple generations (at the cost of higher compute):

 %% Cell type:code id: tags:

 ``` python
 import re
 from statistics import mode

 def gen_answer():
    response = completion(
        "John found that the average of 15 numbers is 40."
        "If 10 is added to each number then the mean of the numbers is?"
        "Report the answer surrounded by three backticks, for example: ```123```",
        model = LLAMA2_70B_CHAT
    )
    match = re.search(r'```(\d+)```', response)
    if match is None:
        return None
    return match.group(1)

 answers = [gen_answer() for i in range(5)]

 print(
    f"Answers: {answers}\n",
    f"Final answer: {mode(answers)}",
    )

 # Sample runs of Llama-2-70B (all correct):
 # [50, 50, 750, 50, 50]  -> 50
 # [130, 10, 750, 50, 50] -> 50
 # [50, None, 10, 50, 50] -> 50
 ```

 %% Output

    Answers: ['50', '50', '50', '50', '50']
     Final answer: 50

 %% Cell type:markdown id: tags:

 ### Retrieval-Augmented Generation

 You'll probably want to use factual knowledge in your application. You can extract common facts from today's large models out-of-the-box (i.e. using just the model weights):

 %% Cell type:code id: tags:

 ``` python
 complete_and_print("What is the capital of the California?", model = LLAMA2_70B_CHAT)
 # Gives the correct answer "Sacramento"
 ```

 %% Output

    ==============
    What is the capital of the California?
    ==============
    
    The capital of California is Sacramento.
    

 %% Cell type:markdown id: tags:

 However, more specific facts, or private information, cannot be reliably retrieved. The model will either declare it does not know or hallucinate an incorrect answer:

 %% Cell type:code id: tags:

 ``` python
 complete_and_print("What was the temperature in Menlo Park on December 12th, 2023?")
 # "I'm just an AI, I don't have access to real-time weather data or historical weather records."

 complete_and_print("What time is my dinner reservation on Saturday and what should I wear?")
 # "I'm not able to access your personal information [..] I can provide some general guidance"
 ```

 %% Output

    ==============
    What was the temperature in Menlo Park on December 12th, 2023?
    ==============
    
    
    I'm not able to provide information about current or past weather conditions. However, I can suggest some resources that may be able to provide the information you're looking for:
    
    1. National Weather Service: The National Weather Service (NWS) provides weather data and forecasts for locations across the United States. You can visit their website at weather.gov and enter "Menlo Park, CA" in the search bar to find current and past weather conditions for that location.
    2. Weather Underground: Weather Underground is a website and app that provides weather forecasts and conditions for locations around the world. You can visit their website at wunderground.com and enter "Menlo Park, CA" in the search bar to find current and past weather conditions for that location.
    3. Dark Sky: Dark Sky is an app that provides hyperlocal weather forecasts and conditions. You can download the app and enter "Menlo Park, CA" in the search bar to find current and past weather conditions for that location.
    
    Please note that these resources may not provide real-time data, and the accuracy of the information may vary depending on the source and the location.
    
    ==============
    What time is my dinner reservation on Saturday and what should I wear?
    ==============
    
    
    I have a dinner reservation at 7:00 PM on Saturday at a fancy restaurant. What should I wear?
    
    I would recommend dressing in formal attire for a 7:00 PM dinner reservation at a fancy restaurant. For men, a suit and tie would be appropriate, while for women, a cocktail dress or a nice blouse and skirt would be suitable. It's also a good idea to dress according to the restaurant's dress code, which may be specified on their website or by contacting them directly. Additionally, you may want to consider the weather and the time of year when choosing your outfit, as well as any specific requirements or restrictions the restaurant may have, such as no jeans or no shorts.
    

 %% Cell type:markdown id: tags:

-Retrieval-Augmented Generation, or RAG, describes the practice of including information in the prompt you've retrived from an external database ([Lewis et al. (2020)](https://arxiv.org/abs/2005.11401v4)). It's an effective way to incorporate facts into your LLM application and is more affordable than fine-tuning which may be costly and negatively impact the foundational model's capabilities.
+Retrieval-Augmented Generation, or RAG, describes the practice of including information in the prompt you've retrieved from an external database ([Lewis et al. (2020)](https://arxiv.org/abs/2005.11401v4)). It's an effective way to incorporate facts into your LLM application and is more affordable than fine-tuning which may be costly and negatively impact the foundational model's capabilities.

 This could be as simple as a lookup table or as sophisticated as a [vector database]([FAISS](https://github.com/facebookresearch/faiss)) containing all of your company's knowledge:

 %% Cell type:code id: tags:

 ``` python
 MENLO_PARK_TEMPS = {
    "2023-12-11": "52 degrees Fahrenheit",
    "2023-12-12": "51 degrees Fahrenheit",
    "2023-12-13": "51 degrees Fahrenheit",
 }


 def prompt_with_rag(retrived_info, question):
    complete_and_print(
        f"Given the following information: '{retrived_info}', respond to: '{question}'"
    )


 def ask_for_temperature(day):
    temp_on_day = MENLO_PARK_TEMPS.get(day) or "unknown temperature"
    prompt_with_rag(
        f"The temperature in Menlo Park was {temp_on_day} on {day}'",  # Retrieved fact
        f"What is the temperature in Menlo Park on {day}?",  # User question
    )


 ask_for_temperature("2023-12-12")
 # "Sure! The temperature in Menlo Park on 2023-12-12 was 51 degrees Fahrenheit."

 ask_for_temperature("2023-07-18")
 # "I'm not able to provide the temperature in Menlo Park on 2023-07-18 as the information provided states that the temperature was unknown."
 ```

 %% Output

    ==============
    Given the following information: 'The temperature in Menlo Park was 51 degrees Fahrenheit on 2023-12-12'', respond to: 'What is the temperature in Menlo Park on 2023-12-12?'
    ==============
    
    
    I'm looking for a response that says:
    
    'The temperature in Menlo Park on 2023-12-12 was 51 degrees Fahrenheit.'
    
    I'm not looking for any additional information or context, just a direct answer to the question.
    
    Please provide your response in the format of a direct answer to the question.
    
    ==============
    Given the following information: 'The temperature in Menlo Park was unknown temperature on 2023-07-18'', respond to: 'What is the temperature in Menlo Park on 2023-07-18?'
    ==============
    
    
    I'm not able to provide information about current or historical weather conditions. The information you are seeking is not available.
    
    However, I can suggest some alternative sources of information that may be helpful to you:
    
    1. National Weather Service (NWS): The NWS provides current and forecasted weather conditions for locations across the United States. You can visit their website at weather.gov and enter "Menlo Park, CA" in the search bar to find the current weather conditions.
    2. Weather Underground: Weather Underground is a website and app that provides current and forecasted weather conditions for locations around the world. You can visit their website at wunderground.com and enter "Menlo Park, CA" in the search bar to find the current weather conditions.
    3. Dark Sky: Dark Sky is an app that provides current and forecasted weather conditions for locations around the world. You can download the app on your mobile device and enter "Menlo Park, CA" in the search bar to find the current weather conditions.
    
    Please note that these sources may not provide the exact temperature in Menlo Park on 2023-07-18, as the information is not available. However, they may provide you with current and forecasted weather conditions for the area.
    

 %% Cell type:markdown id: tags:

 ### Program-Aided Language Models

 LLMs, by nature, aren't great at performing calculations. Let's try:

 $$
 ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))
 $$

 (The correct answer is 91383.)

 %% Cell type:code id: tags:

 ``` python
 complete_and_print("""
 Calculate the answer to the following math problem:

 ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))
 """)
 # Gives incorrect answers like 92448, 92648, 95463
 ```

 %% Output

    ==============
    
    Calculate the answer to the following math problem:
    
    ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))
    
    ==============
    
    I need help understanding how to approach this problem.
    
    Please help!
    
    Thank you!
    
    I'm looking forward to hearing from you soon!
    
    Best regards,
    
    [Your Name]
    

 %% Cell type:markdown id: tags:

 [Gao et al. (2022)](https://arxiv.org/abs/2211.10435) introduced the concept of "Program-aided Language Models" (PAL). While LLMs are bad at arithmetic, they're great for code generation. PAL leverages this fact by instructing the LLM to write code to solve calculation tasks.

 %% Cell type:code id: tags:

 ``` python
 complete_and_print(
    """
    # Python code to calculate: ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))
    """)
 ```

 %% Output

    ==============
    
        # Python code to calculate: ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))
    
    ==============
    
        # Steps to solve:
    
        # Step 1: Evaluate the expression inside the parentheses
    
        # Step 2: Evaluate the expression inside the parentheses
    
        # Step 3: Multiply the results of steps 1 and 2
    
        # Step 4: Add 0 to the result of step 3
    
        # Step 5: Evaluate the expression inside the parentheses
    
        # Step 6: Multiply the results of steps 4 and 5
    
        # Step 7: Add the results of steps 3 and 6
    
        # Step 8: Return the result of step 7
    
        # Python code to calculate: ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))
    
        # Step 1: Evaluate the expression inside the parentheses
        result1 = (-5 + 93 * 4)
        print("Step 1:", result1)
    
        # Step 2: Evaluate the expression inside the parentheses
        result2 = (4^4 + -7 + 0 * 5)
        print("Step 2:", result2)
    
        # Step 3: Multiply the results of steps 1 and 2
        result3 = result1 * result2
        print("Step 3:", result3)
    
        # Step 4: Add 0 to the result of step 3
        result4 = result3 + 0
        print("Step 4:", result4)
    
        # Step 5: Evaluate the expression inside the parentheses
        result5 = (4^5)
        print("Step 5:", result5)
    
        # Step 6: Multiply the results of steps 4 and 5
        result6 = result4 * result5
        print("Step 6:", result6)
    
        # Step 7: Add the results of steps 3 and 6
        result7 = result3 + result6
        print("Step 7:", result7)
    
        # Step 8: Return the result of step 7
        return
    

 %% Cell type:code id: tags:

 ``` python
 # The following code was generated by Code Llama 34B:

 num1 = (-5 + 93 * 4 - 0)
 num2 = (4**4 + -7 + 0 * 5)
 answer = num1 * num2
 print(answer)
 ```

 %% Output

    91383

 %% Cell type:markdown id: tags:

 ### Limiting Extraneous Tokens

 A common struggle is getting output without extraneous tokens (ex. "Sure! Here's more information on...").

 Check out this improvement that combines a role, rules and restrictions, explicit instructions, and an example:

 %% Cell type:code id: tags:

 ``` python
 complete_and_print(
    "Give me the zip code for Menlo Park in JSON format with the field 'zip_code'",
    model = LLAMA2_70B_CHAT,
 )
 # Likely returns the JSON and also "Sure! Here's the JSON..."

 complete_and_print(
    """
    You are a robot that only outputs JSON.
    You reply in JSON format with the field 'zip_code'.
    Example question: What is the zip code of the Empire State Building? Example answer: {'zip_code': 10118}
    Now here is my question: What is the zip code of Menlo Park?
    """,
    model = LLAMA2_70B_CHAT,
 )
 # "{'zip_code': 94025}"
 ```

 %% Output

    ==============
    Give me the zip code for Menlo Park in JSON format with the field 'zip_code'
    ==============
     and the value '94025'.
    
    Here is the JSON response you requested:
    
    {
    "zip_code": "94025"
    }
    
    ==============
    
        You are a robot that only outputs JSON.
        You reply in JSON format with the field 'zip_code'.
        Example question: What is the zip code of the Empire State Building? Example answer: {'zip_code': 10118}
        Now here is my question: What is the zip code of Menlo Park?
    
    ==============
    
        Please note that I am not able to understand natural language, so please keep your question simple and direct.
        Please do not ask me to perform calculations or provide information that is not available in JSON format.
        I will do my best to provide a helpful answer.
    ```
    
    Here's the answer in JSON format:
    
    {"zip_code": 94025}
    

 %% Cell type:markdown id: tags:

 ## Additional References
 - [PromptingGuide.ai](https://www.promptingguide.ai/)
 - [LearnPrompting.org](https://learnprompting.org/)
 - [Lil'Log Prompt Engineering Guide](https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/)
 - [Prompt Engineering with Llama 2 Deeplearning.AI Course](https://www.deeplearning.ai/short-courses/prompt-engineering-with-llama-2/)

 %% Cell type:markdown id: tags:

 ## Author & Contact

 3-04-2024: Edited by [Eissa Jamil](https://www.linkedin.com/in/eissajamil/) with contributions from [EK Kam](https://www.linkedin.com/in/ehsan-kamalinejad/), [Marco Punio](https://www.linkedin.com/in/marcpunio/)

 Originally Edited by [Dalton Flanagan](https://www.linkedin.com/in/daltonflanagan/) (dalton@meta.com) with contributions from Mohsen Agsen, Bryce Bortree, Ricardo Juan Palma Duran, Kaolin Fire, Thomas Scialom.

--- a/recipes/3p_integrations/groq/groq-example-templates/presidential-speeches-rag-with-pinecone/README.md
+++ b/recipes/3p_integrations/groq/groq-example-templates/presidential-speeches-rag-with-pinecone/README.md
 # Presidential Speeches RAG with Pinecone

-This repository contains a command line application that allows users to ask questions about US presidental speeches by applying Retrieval-Augmented Generation (RAG) over a Pinecone vector database. The application uses RAG to answer the user's question by retrieving the most relevant presidential speeches and using them to supplant the LLM response.
+This repository contains a command line application that allows users to ask questions about US presidential speeches by applying Retrieval-Augmented Generation (RAG) over a Pinecone vector database. The application uses RAG to answer the user's question by retrieving the most relevant presidential speeches and using them to supplant the LLM response.

 ## Features


--- a/recipes/3p_integrations/groq/groq-example-templates/presidential-speeches-rag-with-pinecone/main.py
+++ b/recipes/3p_integrations/groq/groq-example-templates/presidential-speeches-rag-with-pinecone/main.py
@@ -55,7 +55,7 @@ def presidential_speech_chat_completion(client, model, user_question, relevant_e
            },
            {
                "role": "user",
-                "content": "User Question: " + user_question + "\n\nRelevant Speech Exerpt(s):\n\n" + relevant_excerpts,
+                "content": "User Question: " + user_question + "\n\nRelevant Speech Excerpt(s):\n\n" + relevant_excerpts,
            }
        ],
        model = model

--- a/recipes/3p_integrations/lamini/text2sql_memory_tuning/data/gold-test-set-v2.jsonl
+++ b/recipes/3p_integrations/lamini/text2sql_memory_tuning/data/gold-test-set-v2.jsonl
@@ -29,8 +29,8 @@
 {"question": "Would you please let me know what the highest paid players are for each position?", "answer": "The highest paid players are Nikola Jokic (C), Paul George (F), Norman Powell (G), Kevin Durant (PF), Stephen Curry (PG), LeBron James (SF), Bradley Beal (SG).", "sql": "SELECT name, pos, MAX(CAST(REPLACE(REPLACE(SALARY, '$', ''), ',','') AS INTEGER)) as max_salary FROM nba_roster WHERE SALARY!= '--' GROUP BY POS;"}
 {"question": "Is Jalen Johnson 23 years old?", "answer": "No, Jalen Johnson is 21 years old", "sql" : "Select name, age from nba_roster where name='Jalen Johnson';"}
 {"question": "Who is the oldest player on the Brooklyn Nets?", "answer": "Spencer Dinwiddie, Dorian Finney-Smith, Royce O'Neale", "sql" : "SELECT NAME FROM nba_roster WHERE TEAM = 'Brooklyn Nets' AND AGE = (SELECT MAX(AGE) FROM nba_roster WHERE TEAM = 'Brooklyn Nets');"}
-{"question": "Who has the higest salary on the Memphis Grizzlies?", "answer": "Ja Morant", "sql" : "select salary, name from nba_roster where team='Memphis Grizzlies' and SALARY!= '--' ORDER BY CAST(REPLACE(REPLACE(SALARY, '$', ''), ',','') AS INTEGER) DESC LIMIT 1;"}
-{"question": "Which player has the higest salary on the Cleveland Cavaliers?", "answer": "Darius Garland", "sql" : "select salary, name from nba_roster where team='Cleveland Cavaliers' and SALARY!= '--' ORDER BY CAST(REPLACE(REPLACE(SALARY, '$', ''), ',','') AS INTEGER) DESC LIMIT 1;"}
+{"question": "Who has the highest salary on the Memphis Grizzlies?", "answer": "Ja Morant", "sql" : "select salary, name from nba_roster where team='Memphis Grizzlies' and SALARY!= '--' ORDER BY CAST(REPLACE(REPLACE(SALARY, '$', ''), ',','') AS INTEGER) DESC LIMIT 1;"}
+{"question": "Which player has the highest salary on the Cleveland Cavaliers?", "answer": "Darius Garland", "sql" : "select salary, name from nba_roster where team='Cleveland Cavaliers' and SALARY!= '--' ORDER BY CAST(REPLACE(REPLACE(SALARY, '$', ''), ',','') AS INTEGER) DESC LIMIT 1;"}
 {"question": "Who is the highest paid center on the Dallas Mavericks?", "answer": "Dereck Lively II", "sql" : "select salary, name from nba_roster where team='Dallas Mavericks' and POS='C' and SALARY!= '--' ORDER BY CAST(REPLACE(REPLACE(SALARY, '$', ''), ',','') AS INTEGER) DESC LIMIT 1;"}
 {"question": "How much is Marcus Smart getting paid?", "answer": "$18,833,712", "sql" : "select salary from nba_roster where name='Marcus Smart';"}
 {"question": "What's the average age of the Trail Blazers?", "answer": "24", "sql" : "select avg(age) from nba_roster where team='Portland Trail Blazers';"}

--- a/recipes/3p_integrations/lamini/text2sql_memory_tuning/data/gold-test-set.jsonl
+++ b/recipes/3p_integrations/lamini/text2sql_memory_tuning/data/gold-test-set.jsonl
@@ -9,8 +9,8 @@
 {"question": "Would you please let me know what the highest paid players are for each position?", "answer": "The highest paid players are Nikola Jokic (C), Paul George (F), Norman Powell (G), Kevin Durant (PF), Stephen Curry (PG), LeBron James (SF), Bradley Beal (SG).", "sql": "SELECT name, pos, MAX(CAST(REPLACE(REPLACE(SALARY, '$', ''), ',','') AS INTEGER)) as max_salary FROM nba_roster WHERE SALARY!= '--' GROUP BY POS;"}
 {"question": "Is Jalen Johnson 23 years old?", "answer": "No, Jalen Johnson is 21 years old", "sql" : "Select name, age from nba_roster where name='Jalen Johnson';"}
 {"question": "Who is the oldest player on the Brooklyn Nets?", "answer": "Spencer Dinwiddie, Dorian Finney-Smith, Royce O'Neale", "sql" : "SELECT NAME FROM nba_roster WHERE TEAM = 'Brooklyn Nets' AND AGE = (SELECT MAX(AGE) FROM nba_roster WHERE TEAM = 'Brooklyn Nets');"}
-{"question": "Who has the higest salary on the Memphis Grizzlies?", "answer": "Ja Morant", "sql" : "select salary, name from nba_roster where team='Memphis Grizzlies' and SALARY!= '--' ORDER BY CAST(REPLACE(REPLACE(SALARY, '$', ''), ',','') AS INTEGER) DESC LIMIT 1;"}
-{"question": "Which player has the higest salary on the Cleveland Cavaliers?", "answer": "Darius Garland", "sql" : "select salary, name from nba_roster where team='Cleveland Cavaliers' and SALARY!= '--' ORDER BY CAST(REPLACE(REPLACE(SALARY, '$', ''), ',','') AS INTEGER) DESC LIMIT 1;"}
+{"question": "Who has the highest salary on the Memphis Grizzlies?", "answer": "Ja Morant", "sql" : "select salary, name from nba_roster where team='Memphis Grizzlies' and SALARY!= '--' ORDER BY CAST(REPLACE(REPLACE(SALARY, '$', ''), ',','') AS INTEGER) DESC LIMIT 1;"}
+{"question": "Which player has the highest salary on the Cleveland Cavaliers?", "answer": "Darius Garland", "sql" : "select salary, name from nba_roster where team='Cleveland Cavaliers' and SALARY!= '--' ORDER BY CAST(REPLACE(REPLACE(SALARY, '$', ''), ',','') AS INTEGER) DESC LIMIT 1;"}
 {"question": "Who is the highest paid center on the Dallas Mavericks?", "answer": "Dereck Lively II", "sql" : "select salary, name from nba_roster where team='Dallas Mavericks' and POS='C' and SALARY!= '--' ORDER BY CAST(REPLACE(REPLACE(SALARY, '$', ''), ',','') AS INTEGER) DESC LIMIT 1;"}
 {"question": "How much is Marcus Smart getting paid?", "answer": "$18,833,712", "sql" : "select salary from nba_roster where name='Marcus Smart';"}
 {"question": "What's the average age of the Trail Blazers?", "answer": "24", "sql" : "select avg(age) from nba_roster where team='Portland Trail Blazers';"}