diff --git a/README.md b/README.md index 0870e7aa397f9341b55494c958a649b152e18aad..64f0cb02ce5da4d78d178066ffc4d2614e8b8952 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,7 @@ # Llama 2 Fine-tuning / Inference Recipes, Examples, Benchmarks and Demo Apps +**[Update Feb. 26, 2024] We added examples to showcase OctoAI's cloud APIs for Llama2, CodeLlama, and LlamaGuard: including [PurpleLlama overview](./examples/Purple_Llama_OctoAI.ipynb), [hello Llama2 cloud](./demo_apps/OctoAI_API_examples/HelloLlamaCloud.ipynb), [getting to know Llama2](./demo_apps/OctoAI_API_examples/Getting_to_know_Llama.ipynb), [live search example](./demo_apps/OctoAI_API_examples/LiveData.ipynb), [Llama2 Gradio demo](./demo_apps/OctoAI_API_examples/Llama2_Gradio.ipynb), [Youtube video summarization](./demo_apps/OctoAI_API_examples/VideoSummary.ipynb), and [retrieval augmented generation overview](./demo_apps/OctoAI_API_examples/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb)**. + **[Update Feb. 5, 2024] We added support for Code Llama 70B instruct in our example [inference script](./examples/code_llama/code_instruct_example.py). For details on formatting the prompt for Code Llama 70B instruct model please refer to [this document](./docs/inference.md)**. **[Update Dec. 28, 2023] We added support for Llama Guard as a safety checker for our example inference script and also with standalone inference with an example script and prompt formatting. More details [here](./examples/llama_guard/README.md). For details on formatting data for fine tuning Llama Guard, we provide a script and sample usage [here](./src/llama_recipes/data/llama_guard/README.md).** diff --git a/demo_apps/OctoAI_API_examples/Getting_to_know_Llama.ipynb b/demo_apps/OctoAI_API_examples/Getting_to_know_Llama.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..96396709f9b67b56992ac83f1d1939acb498c7ee --- /dev/null +++ b/demo_apps/OctoAI_API_examples/Getting_to_know_Llama.ipynb @@ -0,0 +1,1030 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "LERqQn5v8-ak" + }, + "source": [ + "# **Getting to know Llama 2: Everything you need to start building**\n", + "Our goal in this session is to provide a guided tour of Llama 2, including understanding different Llama 2 models, how and where to access them, Generative AI and Chatbot architectures, prompt engineering, RAG (Retrieval Augmented Generation), Fine-tuning and more. All this is implemented with a starter code for you to take it and use it in your Llama 2 projects." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ioVMNcTesSEk" + }, + "source": [ + "##**0 - Prerequisites**\n", + "* Basic understanding of Large Language Models\n", + "\n", + "* Basic understanding of Python" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "executionInfo": { + "elapsed": 248, + "status": "ok", + "timestamp": 1695832228254, + "user": { + "displayName": "Amit Sangani", + "userId": "11552178012079240149" + }, + "user_tz": 420 + }, + "id": "ktEA7qXmwdUM" + }, + "outputs": [], + "source": [ + "# presentation layer code\n", + "\n", + "import base64\n", + "from IPython.display import Image, display\n", + "import matplotlib.pyplot as plt\n", + "\n", + "def mm(graph):\n", + " graphbytes = graph.encode(\"ascii\")\n", + " base64_bytes = base64.b64encode(graphbytes)\n", + " base64_string = base64_bytes.decode(\"ascii\")\n", + " display(Image(url=\"https://mermaid.ink/img/\" + base64_string))\n", + "\n", + "def genai_app_arch():\n", + " mm(\"\"\"\n", + " flowchart TD\n", + " A[Users] --> B(Applications e.g. mobile, web)\n", + " B --> |Hosted API|C(Platforms e.g. Custom, OctoAI, HuggingFace, Replicate)\n", + " B -- optional --> E(Frameworks e.g. LangChain)\n", + " C-->|User Input|D[Llama 2]\n", + " D-->|Model Output|C\n", + " E --> C\n", + " classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;\n", + " \"\"\")\n", + "\n", + "def rag_arch():\n", + " mm(\"\"\"\n", + " flowchart TD\n", + " A[User Prompts] --> B(Frameworks e.g. LangChain)\n", + " B <--> |Database, Docs, XLS|C[fa:fa-database External Data]\n", + " B -->|API|D[Llama 2]\n", + " classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;\n", + " \"\"\")\n", + "\n", + "def llama2_family():\n", + " mm(\"\"\"\n", + " graph LR;\n", + " llama-2 --> llama-2-7b\n", + " llama-2 --> llama-2-13b\n", + " llama-2 --> llama-2-70b\n", + " llama-2-7b --> llama-2-7b-chat\n", + " llama-2-13b --> llama-2-13b-chat\n", + " llama-2-70b --> llama-2-70b-chat\n", + " classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;\n", + " \"\"\")\n", + "\n", + "def apps_and_llms():\n", + " mm(\"\"\"\n", + " graph LR;\n", + " users --> apps\n", + " apps --> frameworks\n", + " frameworks --> platforms\n", + " platforms --> Llama 2\n", + " classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;\n", + " \"\"\")\n", + "\n", + "import ipywidgets as widgets\n", + "from IPython.display import display, Markdown\n", + "\n", + "# Create a text widget\n", + "API_KEY = widgets.Password(\n", + " value='',\n", + " placeholder='',\n", + " description='API_KEY:',\n", + " disabled=False\n", + ")\n", + "\n", + "def md(t):\n", + " display(Markdown(t))\n", + "\n", + "def bot_arch():\n", + " mm(\"\"\"\n", + " graph LR;\n", + " user --> prompt\n", + " prompt --> i_safety\n", + " i_safety --> context\n", + " context --> Llama_2\n", + " Llama_2 --> output\n", + " output --> o_safety\n", + " i_safety --> memory\n", + " o_safety --> memory\n", + " memory --> context\n", + " o_safety --> user\n", + " classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;\n", + " \"\"\")\n", + "\n", + "def fine_tuned_arch():\n", + " mm(\"\"\"\n", + " graph LR;\n", + " Custom_Dataset --> Pre-trained_Llama\n", + " Pre-trained_Llama --> Fine-tuned_Llama\n", + " Fine-tuned_Llama --> RLHF\n", + " RLHF --> |Loss:Cross-Entropy|Fine-tuned_Llama\n", + " classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;\n", + " \"\"\")\n", + "\n", + "def load_data_faiss_arch():\n", + " mm(\"\"\"\n", + " graph LR;\n", + " documents --> textsplitter\n", + " textsplitter --> embeddings\n", + " embeddings --> vectorstore\n", + " classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;\n", + " \"\"\")\n", + "\n", + "def mem_context():\n", + " mm(\"\"\"\n", + " graph LR\n", + " context(text)\n", + " user_prompt --> context\n", + " instruction --> context\n", + " examples --> context\n", + " memory --> context\n", + " context --> tokenizer\n", + " tokenizer --> embeddings\n", + " embeddings --> LLM\n", + " classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;\n", + " \"\"\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i4Np_l_KtIno" + }, + "source": [ + "##**1 - Understanding Llama 2**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PGPSI3M5PGTi" + }, + "source": [ + "### **1.1 - What is Llama 2?**\n", + "\n", + "* State of the art (SOTA), Open Source LLM\n", + "* 7B, 13B, 70B\n", + "* Pretrained + Chat\n", + "* Choosing model: Size, Quality, Cost, Speed\n", + "* [Research paper](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)\n", + "\n", + "* [Responsible use guide](https://ai.meta.com/llama/responsible-use-guide/)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 240 + }, + "executionInfo": { + "elapsed": 248, + "status": "ok", + "timestamp": 1695832233087, + "user": { + "displayName": "Amit Sangani", + "userId": "11552178012079240149" + }, + "user_tz": 420 + }, + "id": "OXRCC7wexZXd", + "outputId": "1feb1918-df4b-4cec-d09e-ffe55c12090b" + }, + "outputs": [], + "source": [ + "llama2_family()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aYeHVVh45bdT" + }, + "source": [ + "###**1.2 - Accessing Llama 2**\n", + "* Download + Self Host (on-premise)\n", + "* Hosted API Platform (e.g. [OctoAI](https://octoai.cloud/), [Replicate](https://replicate.com/meta))\n", + "* Hosted Container Platform (e.g. [Azure](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/introducing-llama-2-on-azure/ba-p/3881233), [AWS](https://aws.amazon.com/blogs/machine-learning/llama-2-foundation-models-from-meta-are-now-available-in-amazon-sagemaker-jumpstart/), [GCP](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/139))\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kBuSay8vtzL4" + }, + "source": [ + "### **1.3 - Use Cases of Llama 2**\n", + "* Content Generation\n", + "* Chatbots\n", + "* Summarization\n", + "* Programming (e.g. Code Llama)\n", + "\n", + "* and many more..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sd54g0OHuqBY" + }, + "source": [ + "##**2 - Using Llama 2**\n", + "\n", + "In this notebook, we are going to access [Llama 13b chat model](https://octoai.cloud/tools/text/chat?mode=demo&model=llama-2-13b-chat-fp16) using hosted API from OctoAI." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h3YGMDJidHtH" + }, + "source": [ + "### **2.1 - Install dependencies**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VhN6hXwx7FCp" + }, + "outputs": [], + "source": [ + "# Install dependencies and initialize\n", + "%pip install -qU \\\n", + " octoai-sdk \\\n", + " langchain \\\n", + " sentence_transformers \\\n", + " pdf2image \\\n", + " pdfminer \\\n", + " pdfminer.six \\\n", + " unstructured \\\n", + " faiss-cpu \\\n", + " pillow-heif \\\n", + " opencv-python \\\n", + " unstructured-inference \\\n", + " pikepdf" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Z8Y8qjEjmg50" + }, + "outputs": [], + "source": [ + "# model on OctoAI platform that we will use for inferencing\n", + "# We will use llama 13b chat model hosted on OctoAI server ()\n", + "\n", + "llama2_13b = \"llama-2-13b-chat-fp16\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8hkWpqWD28ho" + }, + "outputs": [], + "source": [ + "# We will use OctoAI hosted cloud environment\n", + "# Obtain OctoAI API key → https://octo.ai/docs/getting-started/how-to-create-an-octoai-access-token\n", + "\n", + "# enter your replicate api token\n", + "from getpass import getpass\n", + "import os\n", + "\n", + "OCTOAI_API_TOKEN = getpass()\n", + "os.environ[\"OCTOAI_API_TOKEN\"] = OCTOAI_API_TOKEN\n", + "\n", + "# alternatively, you can also store the tokens in environment variables and load it here" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bVCHZmETk36v" + }, + "outputs": [], + "source": [ + "# we will use OctoAI's hosted API\n", + "from octoai.client import Client\n", + "\n", + "client = Client(OCTOAI_API_TOKEN)\n", + "\n", + "# text completion with input prompt\n", + "def Completion(prompt):\n", + " output = client.chat.completions.create(\n", + " messages=[\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": prompt\n", + " }\n", + " ],\n", + " model=\"llama-2-13b-chat-fp16\",\n", + " max_tokens=1000\n", + " )\n", + " return output.choices[0].message.content\n", + "\n", + "# chat completion with input prompt and system prompt\n", + "def ChatCompletion(prompt, system_prompt=None):\n", + " output = client.chat.completions.create(\n", + " messages=[\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": system_prompt\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": prompt\n", + " }\n", + " ],\n", + " model=\"llama-2-13b-chat-fp16\",\n", + " max_tokens=1000\n", + " )\n", + " return output.choices[0].message.content" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5Jxq0pmf6L73" + }, + "source": [ + "### **2.2 - Basic completion**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "H93zZBIk6tNU" + }, + "outputs": [], + "source": [ + "output = Completion(prompt=\"The typical color of a llama is: \")\n", + "md(output)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "StccjUDh6W0Q" + }, + "source": [ + "### **2.3 - System prompts**\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VRnFogxd6rTc" + }, + "outputs": [], + "source": [ + "output = ChatCompletion(\n", + " prompt=\"The typical color of a llama is: \",\n", + " system_prompt=\"respond with only one word\"\n", + " )\n", + "md(output)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Hp4GNa066pYy" + }, + "source": [ + "### **2.4 - Response formats**\n", + "* Can support different formatted outputs e.g. text, JSON, etc." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "HTN79h4RptgQ" + }, + "outputs": [], + "source": [ + "output = ChatCompletion(\n", + " prompt=\"The typical color of a llama is: \",\n", + " system_prompt=\"response in json format\"\n", + " )\n", + "md(output)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cWs_s9y-avIT" + }, + "source": [ + "## **3 - Gen AI Application Architecture**\n", + "\n", + "Here is the high-level tech stack/architecture of Generative AI application." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 446 + }, + "executionInfo": { + "elapsed": 405, + "status": "ok", + "timestamp": 1695832253437, + "user": { + "displayName": "Amit Sangani", + "userId": "11552178012079240149" + }, + "user_tz": 420 + }, + "id": "j9BGuI-9AOL5", + "outputId": "72b2613f-a434-4219-f063-52a409af97cc" + }, + "outputs": [], + "source": [ + "genai_app_arch()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6UlxBtbgys6j" + }, + "source": [ + "##4 - **Chatbot Architecture**\n", + "\n", + "Here are the key components and the information flow in a chatbot.\n", + "\n", + "* User Prompts\n", + "* Input Safety\n", + "* Llama 2\n", + "* Output Safety\n", + "\n", + "* Memory & Context" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 178 + }, + "executionInfo": { + "elapsed": 249, + "status": "ok", + "timestamp": 1695832257063, + "user": { + "displayName": "Amit Sangani", + "userId": "11552178012079240149" + }, + "user_tz": 420 + }, + "id": "tO5HnB56ys6t", + "outputId": "f222d35b-626f-4dc1-b7af-a156a0f3d58b" + }, + "outputs": [], + "source": [ + "bot_arch()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r4DyTLD5ys6t" + }, + "source": [ + "### **4.1 - Chat conversation**\n", + "* LLMs are stateless\n", + "* Single Turn\n", + "\n", + "* Multi Turn (Memory)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EMM_egWMys6u" + }, + "outputs": [], + "source": [ + "# example of single turn chat\n", + "prompt_chat = \"What is the average lifespan of a Llama?\"\n", + "output = ChatCompletion(prompt=prompt_chat, system_prompt=\"answer the last question in few words\")\n", + "md(output)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sZ7uVKDYucgi" + }, + "outputs": [], + "source": [ + "# example without previous context. LLM's are stateless and cannot understand \"they\" without previous context\n", + "prompt_chat = \"What animal family are they?\"\n", + "output = ChatCompletion(prompt=prompt_chat, system_prompt=\"answer the last question in few words\")\n", + "md(output)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WQl3wmfbyBQ1" + }, + "source": [ + "Chat app requires us to send in previous context to LLM to get in valid responses. Below is an example of Multi-turn chat." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "t7SZe5fT3HG3" + }, + "outputs": [], + "source": [ + "# example of multi-turn chat, with storing previous context\n", + "prompt_chat = \"\"\"\n", + "User: What is the average lifespan of a Llama?\n", + "Assistant: Sure! The average lifespan of a llama is around 20-30 years.\n", + "User: What animal family are they?\n", + "\"\"\"\n", + "output = ChatCompletion(prompt=prompt_chat, system_prompt=\"answer the last question\")\n", + "md(output)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "moXnmJ_xyD10" + }, + "source": [ + "### **4.2 - Prompt Engineering**\n", + "* Prompt engineering refers to the science of designing effective prompts to get desired responses\n", + "\n", + "* Helps reduce hallucination\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "t-v-FeZ4ztTB" + }, + "source": [ + "#### **4.2.1 - In-Context Learning (e.g. Zero-shot, Few-shot)**\n", + " * In-context learning - specific method of prompt engineering where demonstration of task are provided as part of prompt.\n", + " 1. Zero-shot learning - model is performing tasks without any\n", + "input examples.\n", + " 2. Few or “N-Shot” Learning - model is performing and behaving based on input examples in user's prompt." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6W71MFNZyRkQ" + }, + "outputs": [], + "source": [ + "# Zero-shot example. To get positive/negative/neutral sentiment, we need to give examples in the prompt\n", + "prompt = '''\n", + "Classify: I saw a Gecko.\n", + "Sentiment: ?\n", + "'''\n", + "output = ChatCompletion(prompt, system_prompt=\"one word response\")\n", + "md(output)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MCQRjf1Y1RYJ" + }, + "outputs": [], + "source": [ + "# By giving examples to Llama, it understands the expected output format.\n", + "\n", + "prompt = '''\n", + "Classify: I love Llamas!\n", + "Sentiment: Positive\n", + "Classify: I dont like Snakes.\n", + "Sentiment: Negative\n", + "Classify: I saw a Gecko.\n", + "Sentiment:'''\n", + "\n", + "output = ChatCompletion(prompt, system_prompt=\"One word response\")\n", + "md(output)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8UmdlTmpDZxA" + }, + "outputs": [], + "source": [ + "# another zero-shot learning\n", + "prompt = '''\n", + "QUESTION: Vicuna?\n", + "ANSWER:'''\n", + "\n", + "output = ChatCompletion(prompt, system_prompt=\"one word response\")\n", + "md(output)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "M_EcsUo1zqFD" + }, + "outputs": [], + "source": [ + "# Another few-shot learning example with formatted prompt.\n", + "\n", + "prompt = '''\n", + "QUESTION: Llama?\n", + "ANSWER: Yes\n", + "QUESTION: Alpaca?\n", + "ANSWER: Yes\n", + "QUESTION: Rabbit?\n", + "ANSWER: No\n", + "QUESTION: Vicuna?\n", + "ANSWER:'''\n", + "\n", + "output = ChatCompletion(prompt, system_prompt=\"one word response\")\n", + "md(output)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mbr124Y197xl" + }, + "source": [ + "#### **4.2.2 - Chain of Thought**\n", + "\"Chain of thought\" enables complex reasoning through logical step by step thinking and generates meaningful and contextually relevant responses." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Xn8zmLBQzpgj" + }, + "outputs": [], + "source": [ + "# Standard prompting\n", + "prompt = '''\n", + "Llama started with 5 tennis balls. It buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does Llama have now?\n", + "'''\n", + "\n", + "output = ChatCompletion(prompt, system_prompt=\"provide short answer\")\n", + "md(output)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "lKNOj79o1Kwu" + }, + "outputs": [], + "source": [ + "# Chain-Of-Thought prompting\n", + "prompt = '''\n", + "Llama started with 5 tennis balls. It buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does Llama have now?\n", + "Let's think step by step.\n", + "'''\n", + "\n", + "output = ChatCompletion(prompt, system_prompt=\"provide short answer\")\n", + "md(output)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C7tDW-AH770Y" + }, + "source": [ + "### **4.3 - Retrieval Augmented Generation (RAG)**\n", + "* Prompt Eng Limitations - Knowledge cutoff & lack of specialized data\n", + "\n", + "* Retrieval Augmented Generation(RAG) allows us to retrieve snippets of information from external data sources and augment it to the user's prompt to get tailored responses from Llama 2.\n", + "\n", + "For our demo, we are going to download an external PDF file from a URL and query against the content in the pdf file to get contextually relevant information back with the help of Llama!\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 259 + }, + "executionInfo": { + "elapsed": 329, + "status": "ok", + "timestamp": 1695832267093, + "user": { + "displayName": "Amit Sangani", + "userId": "11552178012079240149" + }, + "user_tz": 420 + }, + "id": "Fl1LPltpRQD9", + "outputId": "4410c9bf-3559-4a05-cebb-a5731bb094c1" + }, + "outputs": [], + "source": [ + "rag_arch()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JJaGMLl_4vYm" + }, + "source": [ + "#### **4.3.1 - LangChain**\n", + "LangChain is a framework that helps make it easier to implement RAG." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "aoqU3KTcHTWN" + }, + "outputs": [], + "source": [ + "# langchain setup\n", + "from langchain.llms.octoai_endpoint import OctoAIEndpoint\n", + "# Use the Llama 2 model hosted on OctoAI\n", + "# Temperature: Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value\n", + "# top_p: When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens\n", + "# max_new_tokens: Maximum number of tokens to generate. A word is generally 2-3 tokens\n", + "llama_model = OctoAIEndpoint(\n", + " endpoint_url=\"https://text.octoai.run/v1/chat/completions\",\n", + " model_kwargs={\n", + " \"model\": llama2_13b,\n", + " \"messages\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a helpful, respectful and honest assistant.\"\n", + " }\n", + " ],\n", + " \"max_tokens\": 1000,\n", + " \"top_p\": 1,\n", + " \"temperature\": 0.75\n", + " },\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gAV2EkZqcruF" + }, + "outputs": [], + "source": [ + "# Step 1: load the external data source. In our case, we will load Meta’s “Responsible Use Guide” pdf document.\n", + "from langchain.document_loaders import OnlinePDFLoader\n", + "loader = OnlinePDFLoader(\"https://ai.meta.com/static-resource/responsible-use-guide/\")\n", + "documents = loader.load()\n", + "\n", + "# Step 2: Get text splits from document\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)\n", + "all_splits = text_splitter.split_documents(documents)\n", + "\n", + "# Step 3: Use the embedding model\n", + "from langchain.vectorstores import FAISS\n", + "from langchain.embeddings import OctoAIEmbeddings\n", + "embeddings = OctoAIEmbeddings(endpoint_url=\"https://text.octoai.run/v1/embeddings\")\n", + "\n", + "# Step 4: Use vector store to store embeddings\n", + "vectorstore = FAISS.from_documents(all_splits, embeddings)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K2l8S5tBxlkc" + }, + "source": [ + "#### **4.3.2 - LangChain Q&A Retriever**\n", + "* ConversationalRetrievalChain\n", + "\n", + "* Query the Source documents\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NmEhBe3Kiyre" + }, + "outputs": [], + "source": [ + "# Query against your own data\n", + "from langchain.chains import ConversationalRetrievalChain\n", + "chain = ConversationalRetrievalChain.from_llm(llama_model, vectorstore.as_retriever(), return_source_documents=True)\n", + "\n", + "chat_history = []\n", + "query = \"How is Meta approaching open science in two short sentences?\"\n", + "result = chain.invoke({\"question\": query, \"chat_history\": chat_history})\n", + "md(result['answer'])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CelLHIvoy2Ke" + }, + "outputs": [], + "source": [ + "# This time your previous question and answer will be included as a chat history which will enable the ability\n", + "# to ask follow up questions.\n", + "chat_history = [(query, result[\"answer\"])]\n", + "query = \"How is it benefiting the world?\"\n", + "result = chain({\"question\": query, \"chat_history\": chat_history})\n", + "md(result['answer'])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TEvefAWIJONx" + }, + "source": [ + "## **5 - Fine-Tuning Models**\n", + "\n", + "* Limitatons of Prompt Eng and RAG\n", + "* Fine-Tuning Arch\n", + "* Types (PEFT, LoRA, QLoRA)\n", + "* Using PyTorch for Pre-Training & Fine-Tuning\n", + "\n", + "* Evals + Quality\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 79 + }, + "executionInfo": { + "elapsed": 327, + "status": "ok", + "timestamp": 1695832272878, + "user": { + "displayName": "Amit Sangani", + "userId": "11552178012079240149" + }, + "user_tz": 420 + }, + "id": "0a9CvJ8YcTzV", + "outputId": "56a6d573-a195-4e3c-834d-a3b23485186c" + }, + "outputs": [], + "source": [ + "fine_tuned_arch()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_8lcgdZa8onC" + }, + "source": [ + "## **6 - Responsible AI**\n", + "\n", + "* Power + Responsibility\n", + "* Hallucinations\n", + "* Input & Output Safety\n", + "* Red-teaming (simulating real-world cyber attackers)\n", + "* [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pbqb006R-T_k" + }, + "source": [ + "##**7 - Conclusion**\n", + "* Active research on LLMs and Llama\n", + "* Leverage the power of Llama and its open community\n", + "* Safety and responsible use is paramount!\n", + "\n", + "* Call-To-Action\n", + " * [Replicate Free Credits](https://replicate.fyi/connect2023) for Connect attendees!\n", + " * This notebook is available through Llama Github recipes\n", + " * Use Llama in your projects and give us feedback\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gSz5dTMxp7xo" + }, + "source": [ + "#### **Resources**\n", + "- [GitHub - Llama 2](https://github.com/facebookresearch/llama)\n", + "- [Github - LLama 2 Recipes](https://github.com/facebookresearch/llama-recipes)\n", + "- [Llama 2](https://ai.meta.com/llama/)\n", + "- [Research Paper](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)\n", + "- [Model Card](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)\n", + "- [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/)\n", + "- [Acceptable Use Policy](https://ai.meta.com/llama/use-policy/)\n", + "- [OctoAI](https://octoai.cloud/)\n", + "- [LangChain](https://www.langchain.com/)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V7aI6fhZp-KC" + }, + "source": [ + "#### **Authors & Contact**\n", + " * asangani@meta.com, [Amit Sangani | LinkedIn](https://www.linkedin.com/in/amitsangani/)\n", + " * mohsena@meta.com, [Mohsen Agsen | LinkedIn](https://www.linkedin.com/in/mohsen-agsen-62a9791/)\n", + "\n", + "Adapted to run on OctoAI by Thierry Moreau - tmoreau@octo.ai" + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [ + "ioVMNcTesSEk" + ], + "machine_shape": "hm", + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/demo_apps/OctoAI_API_examples/HelloLlamaCloud.ipynb b/demo_apps/OctoAI_API_examples/HelloLlamaCloud.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..449452327448e85fa5a5271ad5c55e73a7c17e19 --- /dev/null +++ b/demo_apps/OctoAI_API_examples/HelloLlamaCloud.ipynb @@ -0,0 +1,448 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "1c1ea03a-cc69-45b0-80d3-664e48ca6831", + "metadata": {}, + "source": [ + "## This demo app shows:\n", + "* How to run Llama2 in the cloud hosted on OctoAI\n", + "* How to use LangChain to ask Llama general questions and follow up questions\n", + "* How to use LangChain to load a recent PDF doc - the Llama2 paper pdf - and chat about it. This is the well known RAG (Retrieval Augmented Generation) method to let LLM such as Llama2 be able to answer questions about the data not publicly available when Llama2 was trained, or about your own data. RAG is one way to prevent LLM's hallucination\n", + "* You should also review the [HelloLlamaLocal](HelloLlamaLocal.ipynb) notebook for more information on RAG\n", + "\n", + "**Note** We will be using OctoAI to run the examples here. You will need to first sign into [OctoAI](https://octoai.cloud/) with your Github or Google account, then create a free API token [here](https://octo.ai/docs/getting-started/how-to-create-an-octoai-access-token) that you can use for a while (a month or $10 in OctoAI credits, whichever one runs out first).\n", + "After the free trial ends, you will need to enter billing info to continue to use Llama2 hosted on OctoAI." + ] + }, + { + "cell_type": "markdown", + "id": "61dde626", + "metadata": {}, + "source": [ + "Let's start by installing the necessary packages:\n", + "- sentence-transformers for text embeddings\n", + "- chromadb gives us database capabilities\n", + "- langchain provides necessary RAG tools for this demo\n", + "\n", + "And setting up the OctoAI token." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2c608df5", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install langchain octoai-sdk sentence-transformers chromadb pypdf" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b9c5546a", + "metadata": {}, + "outputs": [], + "source": [ + "from getpass import getpass\n", + "import os\n", + "\n", + "OCTOAI_API_TOKEN = getpass()\n", + "os.environ[\"OCTOAI_API_TOKEN\"] = OCTOAI_API_TOKEN" + ] + }, + { + "cell_type": "markdown", + "id": "3e8870c1", + "metadata": {}, + "source": [ + "Next we call the Llama 2 model from OctoAI. In this example we will use the Llama 2 13b chat FP16 model. You can find more on Llama 2 models on the [OctoAI text generation solution page](https://octoai.cloud/tools/text).\n", + "\n", + "At the time of writing this notebook the following Llama models are available on OctoAI:\n", + "* llama-2-13b-chat\n", + "* llama-2-70b-chat\n", + "* codellama-7b-instruct\n", + "* codellama-13b-instruct\n", + "* codellama-34b-instruct\n", + "* codellama-70b-instruct" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ad536adb", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.llms.octoai_endpoint import OctoAIEndpoint\n", + "\n", + "llama2_13b = \"llama-2-13b-chat-fp16\"\n", + "llm = OctoAIEndpoint(\n", + " endpoint_url=\"https://text.octoai.run/v1/chat/completions\",\n", + " model_kwargs={\n", + " \"model\": llama2_13b,\n", + " \"messages\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a helpful, respectful and honest assistant.\"\n", + " }\n", + " ],\n", + " \"max_tokens\": 500,\n", + " \"top_p\": 1,\n", + " \"temperature\": 0.01\n", + " },\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "fd207c80", + "metadata": {}, + "source": [ + "With the model set up, you are now ready to ask some questions. Here is an example of the simplest way to ask the model some general questions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "493a7148", + "metadata": {}, + "outputs": [], + "source": [ + "question = \"who wrote the book Innovator's dilemma?\"\n", + "answer = llm(question)\n", + "print(answer)" + ] + }, + { + "cell_type": "markdown", + "id": "f315f000", + "metadata": {}, + "source": [ + "We will then try to follow up the response with a question asking for more information on the book. \n", + "\n", + "Since the chat history is not passed on Llama doesn't have the context and doesn't know this is more about the book thus it treats this as new query.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b5c8676", + "metadata": {}, + "outputs": [], + "source": [ + "# chat history not passed so Llama doesn't have the context and doesn't know this is more about the book\n", + "followup = \"tell me more\"\n", + "followup_answer = llm(followup)\n", + "print(followup_answer)" + ] + }, + { + "cell_type": "markdown", + "id": "9aeaffc7", + "metadata": {}, + "source": [ + "To get around this we will need to provide the model with history of the chat. \n", + "\n", + "To do this, we will use [`ConversationBufferMemory`](https://python.langchain.com/docs/modules/memory/types/buffer) to pass the chat history to the model and give it the capability to handle follow up questions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5428ca27", + "metadata": {}, + "outputs": [], + "source": [ + "# using ConversationBufferMemory to pass memory (chat history) for follow up questions\n", + "from langchain.chains import ConversationChain\n", + "from langchain.memory import ConversationBufferMemory\n", + "\n", + "memory = ConversationBufferMemory()\n", + "conversation = ConversationChain(\n", + " llm=llm, \n", + " memory = memory,\n", + " verbose=False\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "a3e9af5f", + "metadata": {}, + "source": [ + "Once this is set up, let us repeat the steps from before and ask the model a simple question.\n", + "\n", + "Then we pass the question and answer back into the model for context along with the follow up question." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "baee2d22", + "metadata": {}, + "outputs": [], + "source": [ + "# restart from the original question\n", + "answer = conversation.predict(input=question)\n", + "print(answer)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9c7d67a8", + "metadata": {}, + "outputs": [], + "source": [ + "# pass context (previous question and answer) along with the follow up \"tell me more\" to Llama who now knows more of what\n", + "memory.save_context({\"input\": question},\n", + " {\"output\": answer})\n", + "followup_answer = conversation.predict(input=followup)\n", + "print(followup_answer)" + ] + }, + { + "cell_type": "markdown", + "id": "fc436163", + "metadata": {}, + "source": [ + "Next, let's explore using Llama 2 to answer questions using documents for context. \n", + "This gives us the ability to update Llama 2's knowledge thus giving it better context without needing to finetune. \n", + "For a more in-depth study of this, see the notebook on using Llama 2 locally [here](HelloLlamaLocal.ipynb)\n", + "\n", + "We will use the PyPDFLoader to load in a pdf, in this case, the Llama 2 paper." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f5303d75", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.document_loaders import PyPDFLoader\n", + "loader = PyPDFLoader(\"https://arxiv.org/pdf/2307.09288.pdf\")\n", + "docs = loader.load()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "678c2b4a", + "metadata": {}, + "outputs": [], + "source": [ + "# check docs length and content\n", + "print(len(docs), docs[0].page_content[0:300])" + ] + }, + { + "cell_type": "markdown", + "id": "73b8268e", + "metadata": {}, + "source": [ + "We need to store our documents. There are more than 30 vector stores (DBs) supported by LangChain.\n", + "For this example we will use [Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma) which is light-weight and in memory so it's easy to get started with.\n", + "For other vector stores especially if you need to store a large amount of data - see https://python.langchain.com/docs/integrations/vectorstores\n", + "\n", + "We will also import the OctoAIEmbeddings and RecursiveCharacterTextSplitter to assist in storing the documents." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eecb6a34", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.vectorstores import Chroma\n", + "\n", + "# embeddings are numerical representations of the question and answer text\n", + "from langchain_community.embeddings import OctoAIEmbeddings\n", + "\n", + "# use a common text splitter to split text into chunks\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter" + ] + }, + { + "cell_type": "markdown", + "id": "36d4a17c", + "metadata": {}, + "source": [ + "To store the documents, we will need to split them into chunks using [`RecursiveCharacterTextSplitter`](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter) and create vector representations of these chunks using [`OctoAIEmbeddings`](https://octoai.cloud/tools/text/embeddings?mode=api&model=thenlper%2Fgte-large) on them before storing them into our vector database.\n", + "\n", + "In general, you should use larger chuck sizes for highly structured text such as code and smaller size for less structured text. You may need to experiment with different chunk sizes and overlap values to find out the best numbers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bc65e161", + "metadata": {}, + "outputs": [], + "source": [ + "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)\n", + "all_splits = text_splitter.split_documents(docs)\n", + "\n", + "# create the vector db to store all the split chunks as embeddings\n", + "embeddings = OctoAIEmbeddings(\n", + " endpoint_url=\"https://text.octoai.run/v1/embeddings\"\n", + ")\n", + "vectordb = Chroma.from_documents(\n", + " documents=all_splits,\n", + " embedding=embeddings,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "54ad02d7", + "metadata": {}, + "source": [ + "We then use ` RetrievalQA` to retrieve the documents from the vector database and give the model more context on Llama 2, thereby increasing its knowledge.\n", + "\n", + "For each question, LangChain performs a semantic similarity search of it in the vector db, then passes the search results as the context to Llama to answer the question." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "00e3f72b", + "metadata": {}, + "outputs": [], + "source": [ + "# use LangChain's RetrievalQA, to associate Llama with the loaded documents stored in the vector db\n", + "from langchain.chains import RetrievalQA\n", + "\n", + "qa_chain = RetrievalQA.from_chain_type(\n", + " llm,\n", + " retriever=vectordb.as_retriever()\n", + ")\n", + "\n", + "question = \"What is llama2?\"\n", + "result = qa_chain({\"query\": question})\n", + "print(result['result'])" + ] + }, + { + "cell_type": "markdown", + "id": "7e63769a", + "metadata": {}, + "source": [ + "Now, lets bring it all together by incorporating follow up questions.\n", + "\n", + "First we ask a follow up questions without giving the model context of the previous conversation.\n", + "Without this context, the answer we get does not relate to our original question." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "53f27473", + "metadata": {}, + "outputs": [], + "source": [ + "# no context passed so Llama2 doesn't have enough context to answer so it lets its imagination go wild\n", + "result = qa_chain({\"query\": \"what are its use cases?\"})\n", + "print(result['result'])" + ] + }, + { + "cell_type": "markdown", + "id": "833221c0", + "metadata": {}, + "source": [ + "As we did before, let us use the `ConversationalRetrievalChain` package to give the model context of our previous question so we can add follow up questions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "743644a1", + "metadata": {}, + "outputs": [], + "source": [ + "# use ConversationalRetrievalChain to pass chat history for follow up questions\n", + "from langchain.chains import ConversationalRetrievalChain\n", + "chat_chain = ConversationalRetrievalChain.from_llm(llm, vectordb.as_retriever(), return_source_documents=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7c3d1142", + "metadata": {}, + "outputs": [], + "source": [ + "# let's ask the original question \"What is llama2?\" again\n", + "result = chat_chain({\"question\": question, \"chat_history\": []})\n", + "print(result['answer'])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4b17f08f", + "metadata": {}, + "outputs": [], + "source": [ + "# this time we pass chat history along with the follow up so good things should happen\n", + "chat_history = [(question, result[\"answer\"])]\n", + "followup = \"what are its use cases?\"\n", + "followup_answer = chat_chain({\"question\": followup, \"chat_history\": chat_history})\n", + "print(followup_answer['answer'])" + ] + }, + { + "cell_type": "markdown", + "id": "04f4eabf", + "metadata": {}, + "source": [ + "Further follow ups can be made possible by updating chat_history.\n", + "\n", + "Note that results can get cut off. You may set \"max_new_tokens\" in the OctoAIEndpoint call above to a larger number (like shown below) to avoid the cut off.\n", + "\n", + "```python\n", + "model_kwargs={\"temperature\": 0.01, \"top_p\": 1, \"max_new_tokens\": 1000}\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "95d22347", + "metadata": {}, + "outputs": [], + "source": [ + "# further follow ups can be made possible by updating chat_history like this:\n", + "chat_history.append((followup, followup_answer[\"answer\"]))\n", + "more_followup = \"what tasks can it assist with?\"\n", + "more_followup_answer = chat_chain({\"question\": more_followup, \"chat_history\": chat_history})\n", + "print(more_followup_answer['answer'])" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/demo_apps/OctoAI_API_examples/LiveData.ipynb b/demo_apps/OctoAI_API_examples/LiveData.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..442c04e0fd6a40c3f7f6612a57fca309186602b7 --- /dev/null +++ b/demo_apps/OctoAI_API_examples/LiveData.ipynb @@ -0,0 +1,323 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "30eb1704-8d76-4bc9-9308-93243aeb69cb", + "metadata": {}, + "source": [ + "## This demo app shows:\n", + "* How to use LlamaIndex, an open source library to help you build custom data augmented LLM applications\n", + "* How to ask Llama questions about recent live data via the You.com live search API and LlamaIndex\n", + "\n", + "The LangChain package is used to facilitate the call to Llama2 hosted on OctoAI\n", + "\n", + "**Note** We will be using OctoAI to run the examples here. You will need to first sign into [OctoAI](https://octoai.cloud/) with your Github or Google account, then create a free API token [here](https://octo.ai/docs/getting-started/how-to-create-an-octoai-access-token) that you can use for a while (a month or $10 in OctoAI credits, whichever one runs out first).\n", + "After the free trial ends, you will need to enter billing info to continue to use Llama2 hosted on OctoAI." + ] + }, + { + "cell_type": "markdown", + "id": "68cf076e", + "metadata": {}, + "source": [ + "We start by installing the necessary packages:\n", + "- [langchain](https://python.langchain.com/docs/get_started/introduction) which provides RAG capabilities\n", + "- [llama-index](https://docs.llamaindex.ai/en/stable/) for data augmentation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1d0005d6-e928-4d1a-981b-534a40e19e56", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install llama-index langchain" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21fe3849", + "metadata": {}, + "outputs": [], + "source": [ + "# use ServiceContext to configure the LLM used and the custom embeddings\n", + "from llama_index import ServiceContext\n", + "\n", + "# VectorStoreIndex is used to index custom data \n", + "from llama_index import VectorStoreIndex\n", + "\n", + "from langchain.llms.octoai_endpoint import OctoAIEndpoint" + ] + }, + { + "cell_type": "markdown", + "id": "73e8e661", + "metadata": {}, + "source": [ + "Next we set up the OctoAI token." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d9d76e33", + "metadata": {}, + "outputs": [], + "source": [ + "from getpass import getpass\n", + "import os\n", + "\n", + "OCTOAI_API_TOKEN = getpass()\n", + "os.environ[\"OCTOAI_API_TOKEN\"] = OCTOAI_API_TOKEN" + ] + }, + { + "cell_type": "markdown", + "id": "f8ff812b", + "metadata": {}, + "source": [ + "In this example we will use the [YOU.com](https://you.com/) search engine to augment the LLM's responses.\n", + "To use the You.com Search API, you can email api@you.com to request an API key. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "75275628-5235-4b55-8033-601c76107528", + "metadata": {}, + "outputs": [], + "source": [ + "YOUCOM_API_KEY = getpass()\n", + "os.environ[\"YOUCOM_API_KEY\"] = YOUCOM_API_KEY" + ] + }, + { + "cell_type": "markdown", + "id": "cb210c7c", + "metadata": {}, + "source": [ + "We then call the Llama 2 model from OctoAI.\n", + "\n", + "We will use the Llama 2 13b chat FP16 model. You can find more on Llama 2 models on the [OctoAI text generation solution page](https://octoai.cloud/tools/text).\n", + "\n", + "At the time of writing this notebook the following Llama models are available on OctoAI:\n", + "* llama-2-13b-chat\n", + "* llama-2-70b-chat\n", + "* codellama-7b-instruct\n", + "* codellama-13b-instruct\n", + "* codellama-34b-instruct\n", + "* codellama-70b-instruct" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c12fc2cb", + "metadata": {}, + "outputs": [], + "source": [ + "# set llm to be using Llama2 hosted on OctoAI\n", + "llama2_13b = \"llama-2-13b-chat-fp16\"\n", + "\n", + "llm = OctoAIEndpoint(\n", + " endpoint_url=\"https://text.octoai.run/v1/chat/completions\",\n", + " model_kwargs={\n", + " \"model\": llama2_13b,\n", + " \"messages\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a helpful, respectful and honest assistant.\"\n", + " }\n", + " ],\n", + " \"max_tokens\": 500,\n", + " \"top_p\": 1,\n", + " \"temperature\": 0.01\n", + " },\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "476d72da", + "metadata": {}, + "source": [ + "Using our api key we set up earlier, we make a request from YOU.com for live data on a particular topic." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "effc9656-b18d-4d24-a80b-6066564a838b", + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "\n", + "query = \"Meta Connect\" # you can try other live data query about sports score, stock market and weather info \n", + "headers = {\"X-API-Key\": os.environ[\"YOUCOM_API_KEY\"]}\n", + "data = requests.get(\n", + " f\"https://api.ydc-index.io/search?query={query}\",\n", + " headers=headers,\n", + ").json()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8bed3baf-742e-473c-ada1-4459012a8a2c", + "metadata": {}, + "outputs": [], + "source": [ + "# check the query result in JSON\n", + "import json\n", + "\n", + "print(json.dumps(data, indent=2))" + ] + }, + { + "cell_type": "markdown", + "id": "b196e697", + "metadata": {}, + "source": [ + "We then use the [`JSONLoader`](https://llamahub.ai/l/file-json) to extract the text from the returned data. The `JSONLoader` gives us the ability to load the data into LamaIndex.\n", + "In the next cell we show how to load the JSON result with key info stored as \"snippets\".\n", + "\n", + "However, you can also add the snippets in the query result to documents like below:\n", + "```python \n", + "from llama_index import Document\n", + "snippets = [snippet for hit in data[\"hits\"] for snippet in hit[\"snippets\"]]\n", + "documents = [Document(text=s) for s in snippets]\n", + "```\n", + "This can be handy if you just need to add a list of text strings to doc" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7c40e73f-ca13-4f4a-a753-e613df3d389e", + "metadata": {}, + "outputs": [], + "source": [ + "# one way to load the JSON result with key info stored as \"snippets\"\n", + "from llama_index import download_loader\n", + "\n", + "JsonDataReader = download_loader(\"JsonDataReader\")\n", + "loader = JsonDataReader()\n", + "documents = loader.load_data([hit[\"snippets\"] for hit in data[\"hits\"]])\n" + ] + }, + { + "cell_type": "markdown", + "id": "8e5e3b4e", + "metadata": {}, + "source": [ + "With the data set up, we create a vector store for the data and a query engine for it.\n", + "\n", + "For our embeddings we will use `OctoAIEmbeddings` whose default embedding model is GTE-Large. This model provides a good balance between speed and performance.\n", + "\n", + "For more info see https://octoai.cloud/tools/text/embeddings?mode=demo&model=thenlper%2Fgte-large. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a5de3080-2c4b-479c-baba-793b3bee36ed", + "metadata": {}, + "outputs": [], + "source": [ + "# use OctoAI embeddings \n", + "from langchain_community.embeddings import OctoAIEmbeddings\n", + "from llama_index.embeddings import LangchainEmbedding\n", + "\n", + "\n", + "embeddings = LangchainEmbedding(OctoAIEmbeddings(\n", + " endpoint_url=\"https://text.octoai.run/v1/embeddings\"\n", + "))\n", + "print(embeddings)\n", + "\n", + "# create a ServiceContext instance to use Llama2 and custom embeddings\n", + "service_context = ServiceContext.from_defaults(llm=llm, chunk_size=800, chunk_overlap=20, embed_model=embeddings)\n", + "\n", + "# create vector store index from the documents created above\n", + "index = VectorStoreIndex.from_documents(documents, service_context=service_context)\n", + "\n", + "# create query engine from the index\n", + "query_engine = index.as_query_engine(streaming=False)" + ] + }, + { + "cell_type": "markdown", + "id": "2c4ea012", + "metadata": {}, + "source": [ + "We are now ready to ask Llama 2 a question about the live data using our query engine." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "de91a191-d0f2-498e-88dc-b2b43423e0e5", + "metadata": {}, + "outputs": [], + "source": [ + "# ask Llama2 a summary question about the search result\n", + "response = query_engine.query(\"give me a summary\")\n", + "print(str(response))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "72814b20-06aa-4da8-b4dd-f0b0d74a2ea0", + "metadata": {}, + "outputs": [], + "source": [ + "# more questions\n", + "print(str(query_engine.query(\"what products were announced\")))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a65bc037-a689-476d-b529-0059a27bc949", + "metadata": {}, + "outputs": [], + "source": [ + "print(str(query_engine.query(\"tell me more about Meta AI assistant\")))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "16a56542", + "metadata": {}, + "outputs": [], + "source": [ + "print(str(query_engine.query(\"what are Generative AI stickers\")))" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/demo_apps/OctoAI_API_examples/Llama2_Gradio.ipynb b/demo_apps/OctoAI_API_examples/Llama2_Gradio.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..c76d416e83cd8c4769fd1c71881832d0b0444bb2 --- /dev/null +++ b/demo_apps/OctoAI_API_examples/Llama2_Gradio.ipynb @@ -0,0 +1,120 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "47a9adb3", + "metadata": {}, + "source": [ + "## This demo app shows how to query Llama 2 using the Gradio UI.\n", + "\n", + "Since we are using OctoAI in this example, you'll need to obtain an OctoAI token:\n", + "\n", + "- You will need to first sign into [OctoAI](https://octoai.cloud/) with your Github or Google account\n", + "- Then create a free API token [here](https://octo.ai/docs/getting-started/how-to-create-an-octoai-access-token) that you can use for a while (a month or $10 in OctoAI credits, whichever one runs out first)\n", + "\n", + "**Note** After the free trial ends, you will need to enter billing info to continue to use Llama2 hosted on OctoAI.\n", + "\n", + "To run this example:\n", + "- Run the notebook\n", + "- Set up your OCTOAI API token and enter it when prompted\n", + "- Enter your question and click Submit\n", + "\n", + "In the notebook or a browser with URL http://127.0.0.1:7860 you should see a UI with your answer.\n", + "\n", + "Let's start by installing the necessary packages:\n", + "- langchain provides necessary RAG tools for this demo\n", + "- octoai-sdk allows us to use OctoAI Llama 2 endpoint\n", + "- gradio is used for the UI elements\n", + "\n", + "And setting up the OctoAI token." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6ae4f858-6ef7-49d9-b45b-1ef79d0217a0", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install langchain octoai-sdk gradio" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3306c11d-ed82-41c5-a381-15fb5c07d307", + "metadata": {}, + "outputs": [], + "source": [ + "from getpass import getpass\n", + "import os\n", + "\n", + "OCTOAI_API_TOKEN = getpass()\n", + "os.environ[\"OCTOAI_API_TOKEN\"] = OCTOAI_API_TOKEN" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "928041cc", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.schema import AIMessage, HumanMessage\n", + "import gradio as gr\n", + "from langchain.llms.octoai_endpoint import OctoAIEndpoint\n", + "\n", + "llama2_13b = \"llama-2-13b-chat-fp16\"\n", + "\n", + "llm = OctoAIEndpoint(\n", + " endpoint_url=\"https://text.octoai.run/v1/chat/completions\",\n", + " model_kwargs={\n", + " \"model\": llama2_13b,\n", + " \"messages\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a helpful, respectful and honest assistant.\"\n", + " }\n", + " ],\n", + " \"max_tokens\": 500,\n", + " \"top_p\": 1,\n", + " \"temperature\": 0.01\n", + " },\n", + ")\n", + "\n", + "\n", + "def predict(message, history):\n", + " history_langchain_format = []\n", + " for human, ai in history:\n", + " history_langchain_format.append(HumanMessage(content=human))\n", + " history_langchain_format.append(AIMessage(content=ai))\n", + " history_langchain_format.append(HumanMessage(content=message))\n", + " llm_response = llm(message, history_langchain_format)\n", + " return llm_response.content\n", + "\n", + "gr.ChatInterface(predict).launch()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb b/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..d0676b11c09902662c28b952b762c736d9d55c43 --- /dev/null +++ b/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb @@ -0,0 +1,456 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Building a Llama 2 chatbot with Retrieval Augmented Generation (RAG)\n", + "\n", + "This notebook shows a complete example of how to build a Llama 2 chatbot hosted on your browser that can answer questions based on your own data. We'll cover:\n", + "* How to run Llama2 in the cloud hosted on OctoAI\n", + "* A chatbot example built with [Gradio](https://github.com/gradio-app/gradio) and wired to the server\n", + "* Adding RAG capability with Llama 2 specific knowledge based on our Getting Started [guide](https://ai.meta.com/llama/get-started/)\n", + "\n", + "\n", + "**Note** We will be using OctoAI to run the examples here. You will need to first sign into [OctoAI](https://octoai.cloud/) with your Github or Google account, then create a free API token [here](https://octo.ai/docs/getting-started/how-to-create-an-octoai-access-token) that you can use for a while (a month or $10 in OctoAI credits, whichever one runs out first).\n", + "After the free trial ends, you will need to enter billing info to continue to use Llama2 hosted on OctoAI." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## RAG Architecture\n", + "\n", + "LLMs have unprecedented capabilities in NLU (Natural Language Understanding) & NLG (Natural Language Generation), but they have a knowledge cutoff date, and are only trained on publicly available data before that date.\n", + "\n", + "RAG, invented by [Meta](https://ai.meta.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/) in 2020, is one of the most popular methods to augment LLMs. RAG allows enterprises to keep sensitive data on-prem and get more relevant answers from generic models without fine-tuning models for specific roles.\n", + "\n", + "RAG is a method that:\n", + "* Retrieves data from outside a foundation model\n", + "* Augments your questions or prompts to LLMs by adding the retrieved relevant data as context\n", + "* Allows LLMs to answer questions about your own data, or data not publicly available when LLMs were trained\n", + "* Greatly reduces the hallucination in model's response generation\n", + "\n", + "The following diagram shows the general RAG components and process:" + ] + }, + { + "attachments": { + "image.png": { + "image/png": "" + } + }, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## How to Develop a RAG Powered Llama 2 Chatbot\n", + "\n", + "The easiest way to develop RAG-powered Llama 2 chatbots is to use frameworks such as [**LangChain**](https://www.langchain.com/) and [**LlamaIndex**](https://www.llamaindex.ai/), two leading open-source frameworks for building LLM apps. Both offer convenient APIs for implementing RAG with Llama 2 including:\n", + "\n", + "* Load and split documents\n", + "* Embed and store document splits\n", + "* Retrieve the relevant context based on the user query\n", + "* Call Llama 2 with query and context to generate the answer\n", + "\n", + "LangChain is a more general purpose and flexible framework for developing LLM apps with RAG capabilities, while LlamaIndex as a data framework focuses on connecting custom data sources to LLMs. The integration of the two may provide the best performant and effective solution to building real world RAG apps.\n", + "In our example, for simplicifty, we will use LangChain alone with locally stored PDF data." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Install Dependencies\n", + "\n", + "For this demo, we will be using the Gradio for chatbot UI, Text-generation-inference framework for model serving.\n", + "For vector storage and similarity search, we will be using [FAISS](https://github.com/facebookresearch/faiss).\n", + "In this example, we will be running everything in a AWS EC2 instance (i.e. [g5.2xlarge]( https://aws.amazon.com/ec2/instance-types/g5/)). g5.2xlarge features one A10G GPU. We recommend running this notebook with at least one GPU equivalent to A10G with at least 16GB video memory.\n", + "There are certain techniques to downsize the Llama 2 7B model, so it can fit into smaller GPUs. But it is out of scope here.\n", + "\n", + "First, let's install all dependencies with PIP. We also recommend you start a dedicated Conda environment for better package management.\n", + "\n", + "And let's set up the OctoAI token." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -r requirements.txt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from getpass import getpass\n", + "import os\n", + "\n", + "OCTOAI_API_TOKEN = getpass()\n", + "os.environ[\"OCTOAI_API_TOKEN\"] = OCTOAI_API_TOKEN" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Processing\n", + "\n", + "First run all the imports and define the path of the data and vector storage after processing.\n", + "For the data, we will be using a raw pdf crawled from Llama 2 Getting Started guide on [Meta AI website](https://ai.meta.com/llama/)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.embeddings import OctoAIEmbeddings\n", + "from langchain.vectorstores import FAISS\n", + "from langchain.document_loaders import PyPDFDirectoryLoader\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "\n", + "DATA_PATH = 'data' #Your root data folder path\n", + "DB_FAISS_PATH = 'vectorstore/db_faiss'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then we use the `PyPDFDirectoryLoader` to load the entire directory. You can also use `PyPDFLoader` for loading one single file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "loader = PyPDFDirectoryLoader(DATA_PATH)\n", + "documents = loader.load()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Check the length and content of the doc to ensure we have loaded the right document with number of pages as 37." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(len(documents), documents[0].page_content[0:100])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Split the loaded documents into smaller chunks.\n", + "[`RecursiveCharacterTextSplitter`](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html) is one common splitter that splits long pieces of text into smaller, semantically meaningful chunks.\n", + "Other splitters include:\n", + "* SpacyTextSplitter\n", + "* NLTKTextSplitter\n", + "* SentenceTransformersTokenTextSplitter\n", + "* CharacterTextSplitter" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=10)\n", + "splits = text_splitter.split_documents(documents)\n", + "print(len(splits), splits[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that we have set `chunk_size` to 500 and `chunk_overlap` to 10. In the spliting, these two parameters can directly affects the quality of the LLM's answers.\n", + "Here is a good [guide](https://dev.to/peterabel/what-chunk-size-and-chunk-overlap-should-you-use-4338) on how you should carefully set these two parameters." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next we will need to choose an embedding model for our splited documents.\n", + "**Embeddings are numerial representations of text**. The default embedding model in OctoAI Embeddings is GTE-Large with a 1024 vector length." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "embeddings = OctoAIEmbeddings(endpoint_url=\"https://text.octoai.run/v1/embeddings\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Lastly, with splits and choice of the embedding model ready, we want to index them and store all the split chunks as embeddings into the vector storage.\n", + "\n", + "Vector stores are databases storing embeddings. There're at least 60 [vector stores](https://python.langchain.com/docs/integrations/vectorstores) supported by LangChain, and two of the most popular open source ones are:\n", + "* [Chroma](https://www.trychroma.com/): a light-weight and in memory so it's easy to get started with and use for **local development**.\n", + "* [FAISS](https://python.langchain.com/docs/integrations/vectorstores/faiss) (Facebook AI Similarity Search): a vector store that supports search in vectors that may not fit in RAM and is appropriate for **production use**.\n", + "\n", + "Since we are running on a EC2 instance with abundant CPU resources and RAM, we will use FAISS in this example. Note that FAISS can also run on GPUs, where some of the most useful algorithms are implemented there. In that case, install `faiss-gpu` package with PIP instead." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "db = FAISS.from_documents(splits, embeddings)\n", + "db.save_local(DB_FAISS_PATH)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once you saved database into local path. You can find them as `index.faiss` and `index.pkl`. In the chatbot example, you can then load this database from local and plug it into our retrival process." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Building the Chatbot UI\n", + "\n", + "Now we are ready to build the chatbot UI to wire up RAG data and API server. In our example we will be using Gradio to build the Chatbot UI.\n", + "Gradio is an open-source Python library that is used to build machine learning and data science demos and web applications. It has been widely used by the community. Other alternatives are:\n", + "* [Streamlit](https://streamlit.io/)\n", + "* [Dash](https://plotly.com/dash/)\n", + "* [Flask](https://flask.palletsprojects.com/en/3.0.x/)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Again, we start by adding all the imports, paths, constants and set LangChain in debug mode, so it shows clear actions within the chain process." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import langchain\n", + "from queue import Queue\n", + "from typing import Any\n", + "from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler\n", + "from langchain.schema import LLMResult\n", + "from langchain.embeddings import OctoAIEmbeddings\n", + "from langchain.vectorstores import FAISS\n", + "from langchain.chains import RetrievalQA\n", + "from langchain.prompts.prompt import PromptTemplate\n", + "from anyio.from_thread import start_blocking_portal #For model callback streaming\n", + "\n", + "# langchain.debug=True\n", + "\n", + "#vector db path\n", + "DB_FAISS_PATH = 'vectorstore/db_faiss'\n", + "\n", + "model_dict = {\n", + " \"13-chat\" : \"llama-2-13b-chat-fp16\",\n", + " \"70b-chat\" : \"llama-2-70b-chat-fp16\",\n", + "}\n", + "\n", + "system_message = {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then we load the FAISS vector store" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "embeddings = OctoAIEmbeddings(endpoint_url=\"https://text.octoai.run/v1/embeddings\")\n", + "db = FAISS.load_local(DB_FAISS_PATH, embeddings)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next we call the Llama 2 model from OctoAI. In this example we will use the Llama 2 13b chat FP16 model. You can find more on Llama 2 models on the [OctoAI text generation solution page](https://octoai.cloud/tools/text).\n", + "\n", + "At the time of writing this notebook the following Llama models are available on OctoAI:\n", + "* llama-2-13b-chat\n", + "* llama-2-70b-chat\n", + "* codellama-7b-instruct\n", + "* codellama-13b-instruct\n", + "* codellama-34b-instruct\n", + "* codellama-70b-instruct" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.llms.octoai_endpoint import OctoAIEndpoint\n", + "\n", + "llama2_13b = \"llama-2-13b-chat-fp16\"\n", + "llm = OctoAIEndpoint(\n", + " endpoint_url=\"https://text.octoai.run/v1/chat/completions\",\n", + " model_kwargs={\n", + " \"model\": llama2_13b,\n", + " \"messages\": [system_message],\n", + " \"max_tokens\": 500,\n", + " \"top_p\": 1,\n", + " \"temperature\": 0.01\n", + " },\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we define the retriever and template for our RetrivalQA chain. For each call of the RetrievalQA, LangChain performs a semantic similarity search of the query in the vector database, then passes the search results as the context to Llama to answer the query about the data stored in the verctor database.\n", + "Whereas for the template, this defines the format of the question along with context that we will be sent into Llama for generation. In general, Llama 2 has special prompt format to handle special tokens. In some cases, the serving framework might already have taken care of it. Otherwise, you will need to write customized template to properly handle that." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "template = \"\"\"\n", + "[INST]Use the following pieces of context to answer the question. If no context provided, answer like a AI assistant.\n", + "{context}\n", + "Question: {question} [/INST]\n", + "\"\"\"\n", + "\n", + "retriever = db.as_retriever(\n", + " search_kwargs={\"k\": 6}\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Lastly, we can define the retrieval chain for QA" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "qa_chain = RetrievalQA.from_chain_type(\n", + " llm=llm,\n", + " retriever=retriever,\n", + " chain_type_kwargs={\n", + " \"prompt\": PromptTemplate(\n", + " template=template,\n", + " input_variables=[\"context\", \"question\"],\n", + " ),\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we should have a working chain for QA. Let's test it out before wire it up with UI blocks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "result = qa_chain.invoke({\"query\": \"Why choose Llama?\"})\n", + "print(result[\"result\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After confirming the validity, we can start building the UI. We'll use a simple interface built out of Gradio's ChatInterface." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import gradio as gr\n", + "\n", + "def predict(message, history):\n", + " llm_response = qa_chain.invoke(message)[\"result\"]\n", + " return llm_response\n", + "\n", + "gr.ChatInterface(predict).launch()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/data/Llama Getting Started Guide.pdf b/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/data/Llama Getting Started Guide.pdf new file mode 100644 index 0000000000000000000000000000000000000000..886e864ee58c83fb1ac02d2bece9de71b2796a62 Binary files /dev/null and b/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/data/Llama Getting Started Guide.pdf differ diff --git a/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/requirements.txt b/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..921c102df15f3ea9e9579fbdb928775296676664 --- /dev/null +++ b/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/requirements.txt @@ -0,0 +1,7 @@ +gradio==4.16.0 +pypdf==4.0.0 +langchain==0.1.7 +sentence-transformers==2.2.2 +faiss-cpu==1.7.4 +text-generation==0.6.1 +octoai-sdk==0.8.3 \ No newline at end of file diff --git a/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/vectorstore/db_faiss/index.faiss b/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/vectorstore/db_faiss/index.faiss new file mode 100644 index 0000000000000000000000000000000000000000..52a98c4047ecdb963ec8d3852d0580ee4b9e0ebe Binary files /dev/null and b/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/vectorstore/db_faiss/index.faiss differ diff --git a/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/vectorstore/db_faiss/index.pkl b/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/vectorstore/db_faiss/index.pkl new file mode 100644 index 0000000000000000000000000000000000000000..620862972286ea8bae0310f8e601dccaa77b1515 Binary files /dev/null and b/demo_apps/OctoAI_API_examples/RAG_Chatbot_example/vectorstore/db_faiss/index.pkl differ diff --git a/demo_apps/OctoAI_API_examples/VideoSummary.ipynb b/demo_apps/OctoAI_API_examples/VideoSummary.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..edce77a05ad92a67eabf2b4f80662ef0477ab5c3 --- /dev/null +++ b/demo_apps/OctoAI_API_examples/VideoSummary.ipynb @@ -0,0 +1,383 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "30b1235c-2f3e-4628-9c90-30385f741550", + "metadata": {}, + "source": [ + "## This demo app shows:\n", + "* How to use LangChain's YoutubeLoader to retrieve the caption in a YouTube video\n", + "* How to ask Llama to summarize the content (per the Llama's input size limit) of the video in a naive way using LangChain's stuff method\n", + "* How to bypass the limit of Llama's max input token size by using a more sophisticated way using LangChain's map_reduce and refine methods - see [here](https://python.langchain.com/docs/use_cases/summarization) for more info" + ] + }, + { + "cell_type": "markdown", + "id": "c866f6be", + "metadata": {}, + "source": [ + "We start by installing the necessary packages:\n", + "- [youtube-transcript-api](https://pypi.org/project/youtube-transcript-api/) API to get transcript/subtitles of a YouTube video\n", + "- [langchain](https://python.langchain.com/docs/get_started/introduction) provides necessary RAG tools for this demo\n", + "- [tiktoken](https://github.com/openai/tiktoken) BytePair Encoding tokenizer\n", + "- [pytube](https://pytube.io/en/latest/) Utility for downloading YouTube videos\n", + "\n", + "**Note** This example uses OctoAI to host the Llama model. If you have not set up/or used OctoAI before, we suggest you take a look at the [HelloLlamaCloud](HelloLlamaCloud.ipynb) example for information on how to set up OctoAI before continuing with this example.\n", + "If you do not want to use OctoAI, you will need to make some changes to this notebook as you go along." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "02482167", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install langchain octoai-sdk youtube-transcript-api tiktoken pytube" + ] + }, + { + "cell_type": "markdown", + "id": "af3069b1", + "metadata": {}, + "source": [ + "Let's load the YouTube video transcript using the YoutubeLoader." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3e4b8598", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.document_loaders import YoutubeLoader\n", + "\n", + "loader = YoutubeLoader.from_youtube_url(\n", + " \"https://www.youtube.com/watch?v=1k37OcjH7BM\", add_video_info=True\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dca32ebb", + "metadata": {}, + "outputs": [], + "source": [ + "# load the youtube video caption into Documents\n", + "docs = loader.load()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "afba128f-b7fd-4b2f-873f-9b5163455d54", + "metadata": {}, + "outputs": [], + "source": [ + "# check the docs length and content\n", + "len(docs[0].page_content), docs[0].page_content[:300]" + ] + }, + { + "cell_type": "markdown", + "id": "4af7cc16", + "metadata": {}, + "source": [ + "We are using OctoAI in this example to host our Llama 2 model so you will need to get a OctoAI token.\n", + "\n", + "To get the OctoAI token:\n", + "\n", + "- You will need to first sign in with OctoAI with your github account\n", + "- Then create a free API token [here](https://octo.ai/docs/getting-started/how-to-create-an-octoai-access-token) that you can use for a while (a month or $10 in OctoAI credits, whichever one runs out first)\n", + "\n", + "**Note** After the free trial ends, you will need to enter billing info to continue to use Llama2 hosted on OctoAI.\n", + "\n", + "Alternatively, you can run Llama locally. See:\n", + "- [HelloLlamaLocal](HelloLlamaLocal.ipynb) for further information on how to run Llama locally." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ab3ac00e", + "metadata": {}, + "outputs": [], + "source": [ + "# enter your OctoAI API token, or you can use local Llama. See README for more info\n", + "from getpass import getpass\n", + "import os\n", + "\n", + "OCTOAI_API_TOKEN = getpass()\n", + "os.environ[\"OCTOAI_API_TOKEN\"] = OCTOAI_API_TOKEN" + ] + }, + { + "cell_type": "markdown", + "id": "6b911efd", + "metadata": {}, + "source": [ + "Next we call the Llama 2 model from OctoAI. In this example we will use the Llama 2 13b chat FP16 model. You can find more on Llama 2 models on the [OctoAI text generation solution page](https://octoai.cloud/tools/text).\n", + "\n", + "At the time of writing this notebook the following Llama models are available on OctoAI:\n", + "* llama-2-13b-chat\n", + "* llama-2-70b-chat\n", + "* codellama-7b-instruct\n", + "* codellama-13b-instruct\n", + "* codellama-34b-instruct\n", + "* codellama-70b-instruct\n", + "\n", + "If you using local Llama, just set llm accordingly - see the [HelloLlamaLocal notebook](HelloLlamaLocal.ipynb)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "adf8cf3d", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.llms.octoai_endpoint import OctoAIEndpoint\n", + "\n", + "llama2_13b = \"llama-2-13b-chat-fp16\"\n", + "llm = OctoAIEndpoint(\n", + " endpoint_url=\"https://text.octoai.run/v1/chat/completions\",\n", + " model_kwargs={\n", + " \"model\": llama2_13b,\n", + " \"messages\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a helpful, respectful and honest assistant.\"\n", + " }\n", + " ],\n", + " \"max_tokens\": 500,\n", + " \"top_p\": 1,\n", + " \"temperature\": 0.01\n", + " },\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "8e3baa56", + "metadata": {}, + "source": [ + "Once everything is set up, we prompt Llama 2 to summarize the first 4000 characters of the transcript for us." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "51739e11", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.prompts import ChatPromptTemplate\n", + "from langchain.chains import LLMChain\n", + "prompt = ChatPromptTemplate.from_template(\n", + " \"Give me a summary of the text below: {text}?\"\n", + ")\n", + "chain = LLMChain(llm=llm, prompt=prompt)\n", + "# be careful of the input text length sent to LLM\n", + "text = docs[0].page_content[:4000]\n", + "summary = chain.run(text)\n", + "# this is the summary of the first 4000 characters of the video content\n", + "print(summary)" + ] + }, + { + "cell_type": "markdown", + "id": "8b684b29", + "metadata": {}, + "source": [ + "Next we try to summarize all the content of the transcript and we should get a `RuntimeError: Your input is too long. Max input length is 4096 tokens, but you supplied 5597 tokens.`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "88a2c17f", + "metadata": {}, + "outputs": [], + "source": [ + "# try to get a summary of the whole content\n", + "text = docs[0].page_content\n", + "summary = chain.run(text)\n", + "print(summary)" + ] + }, + { + "cell_type": "markdown", + "id": "1ad1881a", + "metadata": {}, + "source": [ + "\n", + "Let's try some workarounds to see if we can summarize the entire transcript without running into the `RuntimeError`.\n", + "\n", + "We will use the LangChain's `load_summarize_chain` and play around with the `chain_type`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9bfee2d3-3afe-41d9-8968-6450cc23f493", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.chains.summarize import load_summarize_chain\n", + "# see https://python.langchain.com/docs/use_cases/summarization for more info\n", + "chain = load_summarize_chain(llm, chain_type=\"stuff\") # other supported methods are map_reduce and refine\n", + "chain.run(docs)\n", + "# same RuntimeError: Your input is too long. but stuff works for shorter text with input length <= 4096 tokens" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "682799a8-3846-41b1-a908-02ab5ac3ecee", + "metadata": {}, + "outputs": [], + "source": [ + "chain = load_summarize_chain(llm, chain_type=\"refine\")\n", + "# still get the \"RuntimeError: Your input is too long. Max input length is 4096 tokens\"\n", + "chain.run(docs)" + ] + }, + { + "cell_type": "markdown", + "id": "aecf6328", + "metadata": {}, + "source": [ + "\n", + "Since the transcript is bigger than the model can handle, we can split the transcript into chunks instead and use the [`refine`](https://python.langchain.com/docs/modules/chains/document/refine) `chain_type` to iteratively create an answer." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3be1236a-fe6a-4bf6-983f-0e72dde39fee", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "\n", + "# we need to split the long input text\n", + "text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(\n", + " chunk_size=3000, chunk_overlap=0\n", + ")\n", + "split_docs = text_splitter.split_documents(docs)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "12ae9e9d-3434-4a84-a298-f2b98de9ff01", + "metadata": {}, + "outputs": [], + "source": [ + "# check the splitted docs lengths\n", + "len(split_docs), len(docs), len(split_docs[0].page_content), len(docs[0].page_content)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "127f17fe-d5b7-43af-bd2f-2b47b076d0b1", + "metadata": {}, + "outputs": [], + "source": [ + "# now get the summary of the whole docs - the whole youtube content\n", + "chain = load_summarize_chain(llm, chain_type=\"refine\")\n", + "print(str(chain.run(split_docs)))" + ] + }, + { + "cell_type": "markdown", + "id": "c3976c92", + "metadata": {}, + "source": [ + "You can also use [`map_reduce`](https://python.langchain.com/docs/modules/chains/document/map_reduce) `chain_type` to implement a map reduce like architecture while summarizing the documents." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8991df49-8578-46de-8b30-cb2cd11e30f1", + "metadata": {}, + "outputs": [], + "source": [ + "# another method is map_reduce\n", + "chain = load_summarize_chain(llm, chain_type=\"map_reduce\")\n", + "print(str(chain.run(split_docs)))" + ] + }, + { + "cell_type": "markdown", + "id": "77d580de", + "metadata": {}, + "source": [ + "To investigate further, let's turn on Langchain's debug mode on to get an idea of how many calls are made to the model and the details of the inputs and outputs.\n", + "We will then run our summary using the `stuff` and `refine` `chain_types` and take a look at our output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f2138911-d2b9-41f3-870f-9bc37e2043d9", + "metadata": {}, + "outputs": [], + "source": [ + "# to find how many calls to Llama have been made and the details of inputs and outputs of each call, set langchain to debug\n", + "import langchain\n", + "langchain.debug = True\n", + "\n", + "# stuff method will cause the error in the end\n", + "chain = load_summarize_chain(llm, chain_type=\"stuff\")\n", + "chain.run(split_docs)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "60d1a531-ab48-45cc-a7de-59a14e18240d", + "metadata": {}, + "outputs": [], + "source": [ + "# but refine works\n", + "chain = load_summarize_chain(llm, chain_type=\"refine\")\n", + "chain.run(split_docs)" + ] + }, + { + "cell_type": "markdown", + "id": "61ccd0fb-5cdb-43c4-afaf-05bc9f7cf959", + "metadata": {}, + "source": [ + "\n", + "As you can see, `stuff` fails because it tries to treat all the split documents as one and \"stuffs\" it into one prompt which leads to a much larger prompt than Llama 2 can handle while `refine` iteratively runs over the documents updating its answer as it goes." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/demo_apps/README.md b/demo_apps/README.md index 54a2015c1a94124c69c568a4a4bc2c3b633b6fb6..e4f59252d9ac82a04a55b86b74cda7ffde2cac1c 100644 --- a/demo_apps/README.md +++ b/demo_apps/README.md @@ -55,12 +55,16 @@ python convert.py <path_to_your_downloaded_llama-2-13b_model> ./quantize <path_to_your_downloaded_llama-2-13b_model>/ggml-model-f16.gguf <path_to_your_downloaded_llama-2-13b_model>/ggml-model-q4_0.gguf q4_0 ``` -### [Running Llama2 Hosted in the Cloud](HelloLlamaCloud.ipynb) -The HelloLlama cloud version uses LangChain with Llama2 hosted in the cloud on [Replicate](https://replicate.com). The demo shows how to ask Llama general questions and follow up questions, and how to use LangChain to ask Llama2 questions about **unstructured** data stored in a PDF. +### Running Llama2 Hosted in the Cloud (using [Replicate](HelloLlamaCloud.ipynb) or [OctoAI](OctoAI_API_examples/HelloLlamaCloud.ipynb)) + +The HelloLlama cloud version uses LangChain with Llama2 hosted in the cloud on [Replicate](HelloLlamaCloud.ipynb) and [OctoAI](OctoAI_API_examples/HelloLlamaCloud.ipynb). The demo shows how to ask Llama general questions and follow up questions, and how to use LangChain to ask Llama2 questions about **unstructured** data stored in a PDF. **<a id="replicate_note">Note on using Replicate</a>** To run some of the demo apps here, you'll need to first sign in with Replicate with your github account, then create a free API token [here](https://replicate.com/account/api-tokens) that you can use for a while. After the free trial ends, you'll need to enter billing info to continue to use Llama2 hosted on Replicate - according to Replicate's [Run time and cost](https://replicate.com/meta/llama-2-13b-chat) for the Llama2-13b-chat model used in our demo apps, the model "costs $0.000725 per second. Predictions typically complete within 10 seconds." This means each call to the Llama2-13b-chat model costs less than $0.01 if the call completes within 10 seconds. If you want absolutely no costs, you can refer to the section "Running Llama2 locally on Mac" above or the "Running Llama2 in Google Colab" below. +**<a id="octoai_note">Note on using OctoAI</a>** +You can also use [OctoAI](https://octo.ai/) to run some of the Llama demos under [OctoAI_API_examples](OctoAI_API_examples/). You can sign into [OctoAI](https://octoai.cloud) with your Google or GitHub account, which will give you $10 of free credits you can use for a month. Llama2 on OctoAI is priced at [$0.00086 per 1k tokens](https://octo.ai/pricing/) (a ~350-word LLM response), so $10 of free credits should go a very long way (about 10,000 LLM inferences). + ### [Running Llama2 in Google Colab](https://colab.research.google.com/drive/1-uBXt4L-6HNS2D8Iny2DwUpVS4Ub7jnk?usp=sharing) To run Llama2 in Google Colab using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), download the quantized Llama2-7b-chat model [here](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_0.gguf), or follow the instructions above to build it, before uploading it to your Google drive. Note that on the free Colab T4 GPU, the call to Llama could take more than 20 minutes to return; running the notebook locally on M1 MBP takes about 20 seconds. @@ -69,13 +73,13 @@ This tutorial shows how to use Llama 2 with [vLLM](https://github.com/vllm-proje \* To run a quantized Llama2 model on iOS and Android, you can use the open source [MLC LLM](https://github.com/mlc-ai/mlc-llm) or [llama.cpp](https://github.com/ggerganov/llama.cpp). You can even make a Linux OS that boots to Llama2 ([repo](https://github.com/trholding/llama2.c)). -## [VideoSummary](VideoSummary.ipynb): Ask Llama2 to Summarize a YouTube Video +## VideoSummary: Ask Llama2 to Summarize a YouTube Video (using [Replicate](VideoSummary.ipynb) or [OctoAI](OctoAI_API_examples/VideoSummary.ipynb)) This demo app uses Llama2 to return a text summary of a YouTube video. It shows how to retrieve the caption of a YouTube video and how to ask Llama to summarize the content in four different ways, from the simplest naive way that works for short text to more advanced methods of using LangChain's map_reduce and refine to overcome the 4096 limit of Llama's max input token size. ## [NBA2023-24](StructuredLlama.ipynb): Ask Llama2 about Structured Data This demo app shows how to use LangChain and Llama2 to let users ask questions about **structured** data stored in a SQL DB. As the 2023-24 NBA season is around the corner, we use the NBA roster info saved in a SQLite DB to show you how to ask Llama2 questions about your favorite teams or players. -## [LiveData](LiveData.ipynb): Ask Llama2 about Live Data +## LiveData: Ask Llama2 about Live Data (using [Replicate](LiveData.ipynb) or [OctoAI](OctoAI_API_examples/LiveData.ipynb)) This demo app shows how to perform live data augmented generation tasks with Llama2 and [LlamaIndex](https://github.com/run-llama/llama_index), another leading open-source framework for building LLM apps: it uses the [You.com search API](https://documentation.you.com/quickstart) to get live search result and ask Llama2 about them. ## [WhatsApp Chatbot](whatsapp_llama2.md): Building a Llama-enabled WhatsApp Chatbot @@ -102,16 +106,16 @@ Then run the command `streamlit run streamlit_llama2.py` and you'll see on your   -### Running [Gradio](https://www.gradio.app/) with Llama2 +### Running [Gradio](https://www.gradio.app/) with Llama2 (using [Replicate](Llama2_Gradio.ipynb) or [OctoAI](OctoAI_API_examples/Llama2_Gradio.ipynb)) -To see how to query Llama2 and get answers with the Gradio UI both from the notebook and web, just launch the notebook `Llama2_Gradio.ipynb`, replace the `<your replicate api token>` with your API token created [here](https://replicate.com/account/api-tokens) - for more info, see the note [above](#replicate_note). +To see how to query Llama2 and get answers with the Gradio UI both from the notebook and web, just launch the notebook `Llama2_Gradio.ipynb`. For more info, on how to get set up with a token to power these apps, see the note on [Replicate](#replicate_note) and [OctoAI](#octoai_note). Then enter your question, click Submit. You'll see in the notebook or a browser with URL http://127.0.0.1:7860 the following UI:  -### [RAG Chatbot Example](RAG_Chatbot_example/RAG_Chatbot_Example.ipynb) -A complete example of how to build a Llama 2 chatbot hosted on your browser that can answer questions based on your own data. +### RAG Chatbot Example (running [locally](RAG_Chatbot_example/RAG_Chatbot_Example.ipynb) or on [OctoAI](OctoAI_API_examples/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb)) +A complete example of how to build a Llama 2 chatbot hosted on your browser that can answer questions based on your own data using retrieval augmented generation (RAG). You can run Llama2 locally if you have a good enough GPU or on OctoAI if you follow the note [above](#octoai_note). ### [Azure API Llama 2 Example](Azure_API_example/azure_api_example.ipynb) A notebook shows examples of how to use Llama 2 APIs offered by Microsoft Azure Model-as-a-Service in CLI, Python, LangChain and a Gradio chatbot example with memory. diff --git a/examples/Purple_Llama_OctoAI.ipynb b/examples/Purple_Llama_OctoAI.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..9713a9587a951b17effa7ff68fc1fe6c16626df8 --- /dev/null +++ b/examples/Purple_Llama_OctoAI.ipynb @@ -0,0 +1,289 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "LERqQn5v8-ak" + }, + "source": [ + "# **Purple Llama Using OctoAI**\n", + "\n", + "Drawing inspiration from the cybersecurity concept of \"purple teaming,\" Purple Llama embraces both offensive (red team) and defensive (blue team) strategies. Our goal is to empower developers in deploying generative AI models responsibly, aligning with best practices outlined in our Responsible Use Guide." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PGPSI3M5PGTi" + }, + "source": [ + "#### **1 - What is Purple Llama?**\n", + "\n", + "Purple Llama is a an umbrella project that over time will bring together tools and evals to help the community build responsibly with open generative AI models. The initial release will include tools and evals for Cyber Security and Input/Output safeguards but we plan to contribute more in the near future.\n", + "\n", + "* Instruction tuned on Llama2-7b model\n", + "* [CyberSecurity Evals](https://github.com/facebookresearch/PurpleLlama/tree/main/CybersecurityBenchmarks_)\n", + "* [Llama Guard Model](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/)\n", + "* [Download Llama Guard](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)\n", + "* [Purple Llama Website](https://ai.meta.com/llama/purple-llama/)\n", + "* [Purple Llama Github Repo](https://github.com/facebookresearch/PurpleLlama)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aYeHVVh45bdT" + }, + "source": [ + "#### **2 - Accessing Purple Llama**\n", + "* Download + Self Host (i.e. [download Purple Llama](https://ai.meta.com/resources/models-and-libraries/llama-downloads/))\n", + "* Hosted API Platform (e.g. [OctoAI](https://octoai.cloud/), [Anyscale](https://www.anyscale.com/), [Together](https://api.together.xyz/playground/chat/togethercomputer/llama-2-7b-chat), [Replicate](https://replicate.com/meta))\n", + "* Hosted Container Platform (e.g. [Azure](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/introducing-llama-2-on-azure/ba-p/3881233), [AWS](https://aws.amazon.com/blogs/machine-learning/llama-2-foundation-models-from-meta-are-now-available-in-amazon-sagemaker-jumpstart/), [GCP](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/139))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sd54g0OHuqBY" + }, + "source": [ + "#### **3 - Using Purple Llama**\n", + "\n", + "In this notebook, We will use the Llama Guard model managed by the [OctoAI](https://octoai.cloud/tools/text) for inferencing. You'll need to first register an account with OctoAI [here](https://octoai.cloud/) then obtain an OctoAI API token [here](https://docs.octoai.cloud/docs/how-to-create-an-octoai-access-token), that you can use for a while (a month or $10 in OctoAI credits, whichever one runs out first).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C4P-b0eQcyd2" + }, + "source": [ + "First, set the OctoAI API token as environment variables.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "yE3sPjS-cyd2", + "outputId": "93b36bc0-e6d4-493c-c88d-ec5c41266239" + }, + "outputs": [], + "source": [ + "from getpass import getpass\n", + "import os\n", + "\n", + "# enter your OctoAI API key\n", + "OCTOAI_API_TOKEN = getpass()\n", + "os.environ[\"OCTOAI_API_TOKEN\"] = OCTOAI_API_TOKEN" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DOSiDW6hq9dI" + }, + "outputs": [], + "source": [ + "from string import Template\n", + "\n", + "B_INST, E_INST = \"[INST]\", \"[/INST]\"\n", + "\n", + "PROMPT_TEMPLATE = Template(\"\"\"Task: Check if there is unsafe content in '$agent_type' messages in conversations according our safety policy with the below categories.\n", + "<BEGIN UNSAFE CONTENT CATEGORIES>\n", + "O1: Violence and Hate.\n", + "Should not\n", + "- Help people plan or engage in violence.\n", + "- Encourage people to engage in violence.\n", + "- Express hateful, derogatory, or demeaning sentiments against people based on sensitive personal characteristics like their race, color, religion, national origin, sexual orientation, gender, gender identity, or disability.\n", + "- Encourage discrimination against people based on those characteristics and should not use slurs or create insults based on those characteristics.\n", + "Can\n", + "- Provide information on violence and discrimination.\n", + "- Discuss the topics of hate and violence, and can discuss historical events involving violence.\n", + "O2: Sexual Content.\n", + "Should not\n", + "- Engage in sexually explicit (i.e., erotic) conversations and should not encourage people to engage in sexual activities.\n", + "Can\n", + "- Discuss topics related to sexuality and sexual education.\n", + "- Provide general information regarding sexual activities.\n", + "- Provide guidance on topics related to sexual health.\n", + "O3: Criminal Planning.\n", + "Should not\n", + "- Help people plan or engage in criminal activities like theft, kidnapping, or financial crimes, especially when the person indicates that they may act on the information provided to them.\n", + "- Encourage people to engage in criminal activities or condone criminal behavior.\n", + "Can\n", + "- Help people understand how crimes are committed, provided that the response is not so detailed that it could be used to provide a roadmap for future crimes.\n", + "- Help people understand how systems work that are used to prevent crimes, provided those descriptions do not provide detailed information on how to defeat those systems.\n", + "O4: Guns and Illegal Weapons.\n", + "Should not\n", + "- Assist in the planning or execution of firearm or illegal weapons crimes, especially when the person indicates that they may act on the information provided to them.\n", + "- Encourage people to engage in firearm or illegal weapons crimes or condone such behavior.\n", + "Can\n", + "- Discuss firearms and the arguments for and against firearm ownership.\n", + "O5: Regulated or Controlled Substances.\n", + "Should not\n", + "- Encourage or assist people in the trafficking, consumption, or use of regulated or controlled substances, including but not limited to illegal drugs, tobacco, alcohol, and cannabis.\n", + "- Assist or encourage people to create such substances when it is illegal to do so.\n", + "Can\n", + "- Discuss the use of regulated or controlled substances, and can provide general information on those substances (ex: what their chemical composition is and what their effects are).\n", + "- Provide information on how those substances are manufactured, provided that the information is not so detailed as to provide step-by-step instructions for their creation.\n", + "O6: Self-Harm.\n", + "Should not\n", + "- Encourage people to harm themselves.\n", + "- Romanticize or condone self-harm or suicide.\n", + "- Provide information regarding the methods of suicide or self-harm.\n", + "- Provide any information or advice, other than information regarding appropriate health resources, to people who express or imply that they wish to harm themselves.\n", + "Can\n", + "- Provide general information regarding self-harm and suicide (ex: statistics, the mental health challenges faced by those who take these actions).\n", + "Should\n", + "- Provide information regarding appropriate health resources to people who imply or express the desire to harm themselves.\n", + "<END UNSAFE CONTENT CATEGORIES>\n", + "\n", + "<BEGIN CONVERSATION>\n", + "\n", + "$prompt\n", + "\n", + "<END CONVERSATION>\n", + "\n", + "Provide your safety assessment for $agent_type in the above conversation:\n", + "- First line must read 'safe' or 'unsafe'.\n", + "- If unsafe, a second line must include a comma-separated list of violated categories.\"\"\")\n", + "\n", + "\n", + "def format_prompt(role, message):\n", + " prompt = PROMPT_TEMPLATE.substitute(prompt=message, agent_type=role)\n", + " prompt = f\"<s>{B_INST} {prompt.strip()} {E_INST}\"\n", + " return prompt\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "t6hkFlVD9XFw", + "outputId": "25fd187e-a484-4b90-d104-a3320b98e8ea" + }, + "outputs": [], + "source": [ + "%pip install openai" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "xC-OYLF4rGc7", + "outputId": "452ca85a-fd9f-426c-d4ab-7898bdabfab8" + }, + "outputs": [], + "source": [ + "import openai\n", + "\n", + "# Feed a prompt that does comply with the safety policy\n", + "\n", + "client = openai.OpenAI(\n", + " base_url = \"https://text.octoai.run/v1\",\n", + " api_key=OCTOAI_API_TOKEN\n", + ")\n", + "completion = client.completions.create(\n", + " model=\"llamaguard-7b-fp16\",\n", + " prompt=format_prompt(\"User\",\"Tell me about the history of racism in the US\")\n", + ")\n", + "print(completion.choices[0].text)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "JtBTkJRxufb0", + "outputId": "c581052d-348e-45ac-c874-80ecf4416a77" + }, + "outputs": [], + "source": [ + "# Feed a prompt that doesn't comply with the safety policy\n", + "# Violates O3: Criminal Planning.\n", + "\n", + "client = openai.OpenAI(\n", + " base_url = \"https://text.octoai.run/v1\",\n", + " api_key=OCTOAI_API_TOKEN\n", + ")\n", + "completion = client.completions.create(\n", + " model=\"llamaguard-7b-fp16\",\n", + " prompt=format_prompt(\"Agent\", \"User: How do you buy a tiger in the US\\n Agent: Steal one from the zoo\")\n", + ")\n", + "print(completion.choices[0].text)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### **Resources <TBD>**\n", + "- [Llama 2](https://ai.meta.com/llama/)\n", + "- [Getting Started Guide - Llama 2](https://ai.meta.com/llama/get-started/)\n", + "- [GitHub - Llama 2](https://github.com/facebookresearch/llama)\n", + "- [Github - LLama 2 Recipes](https://github.com/facebookresearch/llama-recipes) and [Llama 2 Demo Apps](https://github.com/facebookresearch/llama-recipes/tree/main/demo_apps)\n", + "- [Research Paper](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)\n", + "- [Model Card](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)\n", + "- [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/)\n", + "- [Acceptable Use Policy](https://ai.meta.com/llama/use-policy/)\n", + "- [OctoAI](https://octoai.cloud/)\n", + "- [LangChain](https://www.langchain.com/)\n", + "- [LlamaIndex](https://www.llamaindex.ai/)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### **Authors**\n", + "1. Hakan Inan, Research Scientist, Meta\n", + "2. Rashi Rungta, Software Engineer, Meta\n", + "\n", + "Ported to use OctoAI LlamaGuard endpoints by Thierry Moreau, OctoAI" + ] + } + ], + "metadata": { + "colab": { + "gpuType": "T4", + "include_colab_link": true, + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/examples/README.md b/examples/README.md index 9d00954922bd7635d5b41aa4a53b2e2574579801..2c544d96ace886f834a9a3aca3d768e6672029b6 100644 --- a/examples/README.md +++ b/examples/README.md @@ -13,7 +13,7 @@ python examples/finetuning.py <parameters> ``` Please see [README.md](../README.md) for details. -## Inference +## Inference So far, we have provide the following inference examples: 1. [inference script](./inference.py) script provides support for Hugging Face accelerate, PEFT and FSDP fine tuned models. It also demonstrates safety features to protect the user from toxic or harmful content. @@ -26,7 +26,7 @@ So far, we have provide the following inference examples: 5. [Code Llama](./code_llama/) folder which provides examples for [code completion](./code_llama/code_completion_example.py), [code infilling](./code_llama/code_infilling_example.py) and [Llama2 70B code instruct](./code_llama/code_instruct_example.py). -6. The [Purple Llama Using Anyscale](./Purple_Llama_Anyscale.ipynb) is a notebook that shows how to use Anyscale hosted Llama Guard model to classify user inputs as safe or unsafe. +6. The [Purple Llama Using Anyscale](./Purple_Llama_Anyscale.ipynb) and the [Purple Llama Using OctoAI](./Purple_Llama_OctoAI.ipynb) are notebooks that shows how to use Llama Guard model on Anyscale and OctoAI to classify user inputs as safe or unsafe. 7. [Llama Guard](./llama_guard/) inference example and [safety_checker](../src/llama_recipes/inference/safety_utils.py) for the main [inference](./inference.py) script. The standalone scripts allows to test Llama Guard on user input, or user input and agent response pairs. The safety_checker integration providers a way to integrate Llama Guard on all inference executions, both for the user input and model output.