diff --git a/docs/community/integrations/uptrain.md b/docs/community/integrations/uptrain.md index d07e25fcbf1b56789fb8861587b136bdb7c1b5ae..c91ed8ba48d5bd8c0dc027f6d0dd90d791384e9f 100644 --- a/docs/community/integrations/uptrain.md +++ b/docs/community/integrations/uptrain.md @@ -1,89 +1,121 @@ # Perform Evaluations on LlamaIndex with UpTrain -**Overview**: In this example, we will see how to use UpTrain with LlamaIndex. +**Overview**: In this example, we will see how to use UpTrain with LlamaIndex. UpTrain ([github](https://github.com/uptrain-ai/uptrain) || [website](https://github.com/uptrain-ai/uptrain/) || [docs](https://docs.uptrain.ai/)) is an open-source platform to evaluate and improve GenAI applications. It provides grades for 20+ preconfigured checks (covering language, code, embedding use cases), performs root cause analysis on failure cases and gives insights on how to resolve them. More details on UpTrain's evaluations can be found [here](https://github.com/uptrain-ai/uptrain?tab=readme-ov-file#pre-built-evaluations-we-offer-). -**Problem**: There are two main problems: +**Problem**: As an increasing number of companies are graduating their LLM prototypes to production-ready applications, their RAG pipelines are also getting complex. Developers are utilising modules like QueryRewrite, Context ReRank, etc., to enhance the accuracy of their RAG systems. -1. The data that most Large Language Models are trained on is not representative of the data that they are used on. This leads to a mismatch between the training and test distributions, which can lead to poor performance. -2. The results generated by Large Language Models are not always reliable. The responses might not be relevant to the prompt, not align with the desired tone or the context, or might be offensive etc. +With increasing complexity comes more points of failure. -**Solution**: The above two problems are solved by two different tools and we will show you how to use them together: +1. Advanced Evals are needed to evaluate the quality of these newer modules and determine if they actually improve the system's accuracy. +2. A robust experimentation framework is needed to systematically test different modules and make data-driven decisions. -1. LlamaIndex solves the first problem by allowing you to perform Retrieval Augmented Generation (RAG) with a retriever that is fine-tuned on your own data. This allows you to use your own data to fine-tune a retriever, and then use that retriever to perform RAG. -2. UpTrain solves the second problem by allowing you to perform evaluations on the generated responses. This helps you to ensure that the responses are relevant to the prompt, align with the desired tone or the context, and are not offensive etc. +**Solution**: UpTrain helps to solve for both: + +1. UpTrain provides a series of checks to evaluate the quality of generated response, retrieved-context as well as all the interim steps. The relevant checks are ContextRelevance, SubQueryCompleteness, ContextReranking, ContextConciseness, FactualAccuracy, ContextUtilization, ResponseCompleteness, ResponseConciseness, etc. +2. UpTrain also allows you to experiment with different embedding models as well as have an "evaluate_experiments" method to compare different RAG configurations. # How to go about it? -There two ways you can use UpTrain with LlamaIndex: +There are two ways you can use UpTrain with LlamaIndex: -1. **Using the UpTrain Callback Handler**: This method allows you to seamlessly integrate UpTrain with LlamaIndex. You can simply add UpTrainCallbackHandler to your existing LlamaIndex pipeline and it will take care of sending the generated responses to the UpTrain Managed Service for evaluations. This is the recommended method as it is the easiest to use and provides you with dashboards and insights with minimal effort. +1. **Using the UpTrain Callback Handler**: This method allows you to seamlessly integrate UpTrain with LlamaIndex. You can simply add UpTrainCallbackHandler to your existing LlamaIndex pipeline and it will evaluate all components of your RAG pipeline. This is the recommended method as it is the easiest to use and provides you with dashboards and insights with minimal effort. 2. **Using UpTrain's EvalLlamaIndex**: This method allows you to use UpTrain to perform evaluations on the generated responses. You can use the EvalLlamaIndex object to generate responses for the queries and then perform evaluations on the responses. You can find a detailed tutorial on how to do this below. This method offers more flexibility and control over the evaluations, but requires more effort to set up and use. # 1. Using the UpTrain Callback Handler <a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/callbacks/UpTrainCallback.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> -Three additional evaluations for Llamaindex have been introduced, complementing existing ones. These evaluations run automatically, with results displayed in the output. More details on UpTrain's evaluations can be found [here](https://github.com/uptrain-ai/uptrain?tab=readme-ov-file#pre-built-evaluations-we-offer-). +Below is how to use UpTrain Callback Handler to evaluate different components of your RAG pipelines. + +## 1. **RAG Query Engine Evaluations**: + +The RAG query engine plays a crucial role in retrieving context and generating responses. To ensure its performance and response quality, we conduct the following evaluations: + +- **[Context Relevance](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-relevance)**: Determines if the retrieved context has sufficient information to answer the user query or not. +- **[Factual Accuracy](https://docs.uptrain.ai/predefined-evaluations/context-awareness/factual-accuracy)**: Assesses if the LLM's response can be verified via the retrieved context. +- **[Response Completeness](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-completeness)**: Checks if the response contains all the information required to answer the user query comprehensively. + +## 2. **Sub-Question Query Generation Evaluation**: + +The SubQuestionQueryGeneration operator decomposes a question into sub-questions, generating responses for each using an RAG query engine. To measure it's accuracy, we use: + +- **[Sub Query Completeness](https://docs.uptrain.ai/predefined-evaluations/query-quality/sub-query-completeness)**: Assures that the sub-questions accurately and comprehensively cover the original query. + +## 3. **Re-Ranking Evaluations**: + +Re-ranking involves reordering nodes based on relevance to the query and choosing the top nodes. Different evaluations are performed based on the number of nodes returned after re-ranking. + +a. Same Number of Nodes + +- **[Context Reranking](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-reranking)**: Checks if the order of re-ranked nodes is more relevant to the query than the original order. -Selected operators from the LlamaIndex pipeline are highlighted for demonstration: +b. Different Number of Nodes: + +- **[Context Conciseness](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-conciseness)**: Examines whether the reduced number of nodes still provides all the required information. + +These evaluations collectively ensure the robustness and effectiveness of the RAG query engine, SubQuestionQueryGeneration operator, and the re-ranking process in the LlamaIndex pipeline. + +#### **Note:** + +- We have performed evaluations using a basic RAG query engine; the same evaluations can be performed using the advanced RAG query engine as well. +- Same is true for Re-Ranking evaluations, we have performed evaluations using SentenceTransformerRerank, the same evaluations can be performed using other re-rankers as well. ## 1. **RAG Query Engine Evaluations**: The RAG query engine plays a crucial role in retrieving context and generating responses. To ensure its performance and response quality, we conduct the following evaluations: -- **Context Relevance**: Determines if the context extracted from the query is relevant to the response. -- **Factual Accuracy**: Assesses if the LLM is hallcuinating or providing incorrect information. -- **Response Completeness**: Checks if the response contains all the information requested by the query. +- **[Context Relevance](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-relevance)**: Determines if the retrieved context has sufficient information to answer the user query or not. +- **[Factual Accuracy](https://docs.uptrain.ai/predefined-evaluations/context-awareness/factual-accuracy)**: Assesses if the LLM's response can be verified via the retrieved context. +- **[Response Completeness](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-completeness)**: Checks if the response contains all the information required to answer the user query comprehensively. ## 2. **Sub-Question Query Generation Evaluation**: -The SubQuestionQueryGeneration operator decomposes a question into sub-questions, generating responses for each using a RAG query engine. Given the complexity, we include the previous evaluations and add: +The SubQuestionQueryGeneration operator decomposes a question into sub-questions, generating responses for each using an RAG query engine. To measure it's accuracy, we use: -- **Sub Query Completeness**: Assures that the sub-questions accurately and comprehensively cover the original query. +- **[Sub Query Completeness](https://docs.uptrain.ai/predefined-evaluations/query-quality/sub-query-completeness)**: Assures that the sub-questions accurately and comprehensively cover the original query. ## 3. **Re-Ranking Evaluations**: -Re-ranking involves reordering nodes based on relevance to the query and choosing top n nodes. Different evaluations are performed based on the number of nodes returned after re-ranking. +Re-ranking involves reordering nodes based on relevance to the query and choosing the top nodes. Different evaluations are performed based on the number of nodes returned after re-ranking. a. Same Number of Nodes -- **Context Reranking**: Checks if the order of re-ranked nodes is more relevant to the query than the original order. +- **[Context Reranking](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-reranking)**: Checks if the order of re-ranked nodes is more relevant to the query than the original order. b. Different Number of Nodes: -- **Context Conciseness**: Examines whether the reduced number of nodes still provides all the required information. +- **[Context Conciseness](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-conciseness)**: Examines whether the reduced number of nodes still provides all the required information. These evaluations collectively ensure the robustness and effectiveness of the RAG query engine, SubQuestionQueryGeneration operator, and the re-ranking process in the LlamaIndex pipeline. #### **Note:** - We have performed evaluations using basic RAG query engine, the same evaluations can be performed using the advanced RAG query engine as well. -- Same is true for Re-Ranking evaluations, we have performed evaluations using CohereRerank, the same evaluations can be performed using other re-rankers as well. +- Same is true for Re-Ranking evaluations, we have performed evaluations using SentenceTransformerRerank, the same evaluations can be performed using other re-rankers as well. ## Install Dependencies and Import Libraries Install notebook dependencies. ```bash -pip install -q html2text llama-index pandas tqdm uptrain cohere +%pip install llama-index-readers-web +%pip install llama-index-callbacks-uptrain +%pip install -q html2text llama-index pandas tqdm uptrain torch sentence-transformers ``` Import libraries. ```python -from llama_index import ( - ServiceContext, - VectorStoreIndex, -) -from llama_index.node_parser import SentenceSplitter -from llama_index.readers import SimpleWebPageReader -from llama_index.callbacks import CallbackManager, UpTrainCallbackHandler -from llama_index.postprocessor.cohere_rerank import CohereRerank -from llama_index.service_context import set_global_service_context -from llama_index.query_engine.sub_question_query_engine import ( - SubQuestionQueryEngine, -) -from llama_index.tools.query_engine import QueryEngineTool -from llama_index.tools.types import ToolMetadata +from llama_index.core import Settings, VectorStoreIndex +from llama_index.core.node_parser import SentenceSplitter +from llama_index.readers.web import SimpleWebPageReader +from llama_index.core.callbacks import CallbackManager +from llama_index.callbacks.uptrain.base import UpTrainCallbackHandler +from llama_index.core.query_engine import SubQuestionQueryEngine +from llama_index.core.tools import QueryEngineTool, ToolMetadata +from llama_index.core.postprocessor import SentenceTransformerRerank +from llama_index.llms.openai import OpenAI + +import os ``` ## Setup @@ -123,16 +155,17 @@ Parameters: **Note:** The `project_name_prefix` will be used as prefix for the project names in the UpTrain dashboard. These will be different for different types of evals. For example, if you set project_name_prefix="llama" and perform the sub_question evaluation, the project name will be "llama_sub_question_answering". ```python +os.environ[ + "OPENAI_API_KEY" +] = "sk-***********" # Replace with your OpenAI API key + callback_handler = UpTrainCallbackHandler( key_type="openai", - api_key="sk-******************************", + api_key=os.environ["OPENAI_API_KEY"], project_name_prefix="llama", ) -callback_manager = CallbackManager([callback_handler]) -service_context = ServiceContext.from_defaults( - callback_manager=callback_manager -) -set_global_service_context(service_context) + +Settings.callback_manager = CallbackManager([callback_handler]) ``` ## Load and Parse Documents @@ -158,13 +191,13 @@ nodes = parser.get_nodes_from_documents(documents) UpTrain callback handler will automatically capture the query, context and response once generated and will run the following three evaluations _(Graded from 0 to 1)_ on the response: -- **Context Relevance**: Check if the context extractedfrom the query is relevant to the response. -- **Factual Accuracy**: Check how factually accurate the response is. -- **Response Completeness**: Check if the response contains all the information that the query is asking for. +- **[Context Relevance](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-relevance)**: Determines if the retrieved context has sufficient information to answer the user query or not. +- **[Factual Accuracy](https://docs.uptrain.ai/predefined-evaluations/context-awareness/factual-accuracy)**: Assesses if the LLM's response can be verified via the retrieved context. +- **[Response Completeness](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-completeness)**: Checks if the response contains all the information required to answer the user query comprehensively. ```python index = VectorStoreIndex.from_documents( - documents, service_context=service_context + documents, ) query_engine = index.as_query_engine() @@ -181,55 +214,66 @@ for query in queries: ``` Question: What did Paul Graham do growing up? + Response: Paul Graham wrote short stories and started programming on the IBM 1401 in 9th grade using an early version of Fortran. Later, he convinced his father to buy a TRS-80, where he wrote simple games, a program to predict rocket heights, and a word processor. + Context Relevance Score: 0.0 Factual Accuracy Score: 1.0 - Response Completeness Score: 0.0 + Response Completeness Score: 1.0 Question: When and how did Paul Graham's mother die? + Response: Paul Graham's mother died when he was 18 years old, from a brain tumor. + Context Relevance Score: 0.0 - Factual Accuracy Score: 1.0 - Response Completeness Score: 0.0 + Factual Accuracy Score: 0.0 + Response Completeness Score: 1.0 Question: What, in Paul Graham's opinion, is the most distinctive thing about YC? - Context Relevance Score: 1.0 - Factual Accuracy Score: 1.0 + Response: The most distinctive thing about Y Combinator, according to Paul Graham, is that instead of deciding for himself what to work on, the problems come to him. Every 6 months, a new batch of startups brings their problems, which then become the focus of YC's work. + + Context Relevance Score: 0.0 + Factual Accuracy Score: 0.5 Response Completeness Score: 1.0 Question: When and how did Paul Graham meet Jessica Livingston? + Response: Paul Graham met Jessica Livingston at a big party at his house in October 2003. + Context Relevance Score: 1.0 - Factual Accuracy Score: 1.0 - Response Completeness Score: 0.5 + Factual Accuracy Score: 0.5 + Response Completeness Score: 1.0 Question: What is Bel, and when and where was it written? + Response: Bel is a new Lisp that was written in Arc. It was developed over a period of 4 years, from March 26, 2015 to October 12, 2019. Most of the work on Bel was done in England, where the author had moved to in the summer of 2016. + Context Relevance Score: 1.0 Factual Accuracy Score: 1.0 - Response Completeness Score: 0.0 + Response Completeness Score: 1.0 Here's an example of the dashboard showing how you can filter and drill down to the failing cases and get insights on the failing cases:  # 2. Sub-Question Query Engine Evaluation -The **sub question query engine** is used to tackle the problem of answering a complex query using multiple data sources. It first breaks down the complex query into sub questions for each relevant data source, then gather all the intermediate responses and synthesizes a final response. +The **sub-question query engine** is used to tackle the problem of answering a complex query using multiple data sources. It first breaks down the complex query into sub-questions for each relevant data source, then gathers all the intermediate responses and synthesizes a final response. UpTrain callback handler will automatically capture the sub-question and the responses for each of them once generated and will run the following three evaluations _(Graded from 0 to 1)_ on the response: -- **Context Relevance**: Check if the context extractedfrom the query is relevant to the response. -- **Factual Accuracy**: Check how factually accurate the response is. -- **Response Completeness**: Check if the response contains all the information that the query is asking for. +- **[Context Relevance](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-relevance)**: Determines if the retrieved context has sufficient information to answer the user query or not. +- **[Factual Accuracy](https://docs.uptrain.ai/predefined-evaluations/context-awareness/factual-accuracy)**: Assesses if the LLM's response can be verified via the retrieved context. +- **[Response Completeness](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-completeness)**: Checks if the response contains all the information required to answer the user query comprehensively. In addition to the above evaluations, the callback handler will also run the following evaluation: -- **Sub Query Completeness**: Checks if the sub-questions accurately and completely cover the original query. +- **[Sub Query Completeness](https://docs.uptrain.ai/predefined-evaluations/query-quality/sub-query-completeness)**: Assures that the sub-questions accurately and comprehensively cover the original query. ```python # build index and query engine vector_query_engine = VectorStoreIndex.from_documents( - documents=documents, use_async=True, service_context=service_context + documents=documents, + use_async=True, ).as_query_engine() query_engine_tools = [ @@ -244,7 +288,6 @@ query_engine_tools = [ query_engine = SubQuestionQueryEngine.from_defaults( query_engine_tools=query_engine_tools, - service_context=service_context, use_async=True, ) @@ -253,22 +296,38 @@ response = query_engine.query( ) ``` - Question: What did Paul Graham work on during YC? - Context Relevance Score: 0.5 + Generated 3 sub questions. + [1;3;38;2;237;90;200m[documents] Q: What did Paul Graham work on before Y Combinator? + [0m[1;3;38;2;90;149;237m[documents] Q: What did Paul Graham work on during Y Combinator? + [0m[1;3;38;2;11;159;203m[documents] Q: What did Paul Graham work on after Y Combinator? + [0m[1;3;38;2;11;159;203m[documents] A: Paul Graham worked on a project with Robert and Trevor after Y Combinator. + [0m[1;3;38;2;237;90;200m[documents] A: Paul Graham worked on projects with his colleagues Robert and Trevor before Y Combinator. + [0m[1;3;38;2;90;149;237m[documents] A: Paul Graham worked on writing essays and working on Y Combinator during his time at Y Combinator. + [0m + + + Question: What did Paul Graham work on after Y Combinator? + Response: Paul Graham worked on a project with Robert and Trevor after Y Combinator. + + Context Relevance Score: 0.0 Factual Accuracy Score: 1.0 Response Completeness Score: 0.5 - Question: What did Paul Graham work on after YC? - Context Relevance Score: 0.5 + Question: What did Paul Graham work on before Y Combinator? + Response: Paul Graham worked on projects with his colleagues Robert and Trevor before Y Combinator. + + Context Relevance Score: 0.0 Factual Accuracy Score: 1.0 Response Completeness Score: 0.5 - Question: What did Paul Graham work on before YC? - Context Relevance Score: 1.0 - Factual Accuracy Score: 1.0 - Response Completeness Score: 0.0 + Question: What did Paul Graham work on during Y Combinator? + Response: Paul Graham worked on writing essays and working on Y Combinator during his time at Y Combinator. + + Context Relevance Score: 0.0 + Factual Accuracy Score: 0.5 + Response Completeness Score: 0.5 Question: How was Paul Grahams life different before, during, and after YC? @@ -280,7 +339,7 @@ Here's an example of the dashboard visualizing the scores of the sub-questions i # 3. Re-ranking -Re-ranking is the process of reordering the nodes based on their relevance to the query. There are multiple classes of re-ranking algorithms offered by Llamaindex. We have used CohereRerank for this example. +Re-ranking is the process of reordering the nodes based on their relevance to the query. There are multiple classes of re-ranking algorithms offered by Llamaindex. We have used LLMRerank for this example. The re-ranker allows you to enter the number of top n nodes that will be returned after re-ranking. If this value remains the same as the original number of nodes, the re-ranker will only re-rank the nodes and not change the number of nodes. Otherwise, it will re-rank the nodes and return the top n nodes. @@ -290,22 +349,28 @@ We will perform different evaluations based on the number of nodes returned afte If the number of nodes returned after re-ranking is the same as the original number of nodes, the following evaluation will be performed: -- **Context Reranking**: Check if the order of the re-ranked nodes is more relevant to the query than the original order. +- **[Context Reranking](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-reranking)**: Checks if the order of re-ranked nodes is more relevant to the query than the original order. ```python -api_key = "**********************************" # Insert cohere API key here -cohere_rerank = CohereRerank( - api_key=api_key, top_n=5 -) # In this example, the number of nodes before re-ranking is 5 and after re-ranking is also 5. +callback_handler = UpTrainCallbackHandler( + key_type="openai", + api_key=os.environ["OPENAI_API_KEY"], + project_name_prefix="llama", +) +Settings.callback_manager = CallbackManager([callback_handler]) + +rerank_postprocessor = SentenceTransformerRerank( + top_n=3, # number of nodes after reranking + keep_retrieval_score=True, +) index = VectorStoreIndex.from_documents( - documents=documents, service_context=service_context + documents=documents, ) query_engine = index.as_query_engine( - similarity_top_k=10, - node_postprocessors=[cohere_rerank], - service_context=service_context, + similarity_top_k=3, # number of nodes before reranking + node_postprocessors=[rerank_postprocessor], ) response = query_engine.query( @@ -316,25 +381,39 @@ response = query_engine.query( Question: What did Sam Altman do in this essay? Context Reranking Score: 0.0 + + Question: What did Sam Altman do in this essay? + Response: Sam Altman was asked to become the president of Y Combinator after the original founders decided to step back and reorganize the company for long-term sustainability. + + Context Relevance Score: 1.0 + Factual Accuracy Score: 1.0 + Response Completeness Score: 0.5 + # 3b. Re-ranking (With different number of nodes) If the number of nodes returned after re-ranking is the lesser as the original number of nodes, the following evaluation will be performed: -- **Context Conciseness**: If the re-ranked nodes are able to provide all the information required by the query. +- **[Context Conciseness](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-conciseness)**: Examines whether the reduced number of nodes still provides all the required information. ```python -api_key = "**********************************" # insert cohere API key here -cohere_rerank = CohereRerank( - api_key=api_key, top_n=2 -) # In this example, the number of nodes before re-ranking is 5 and after re-ranking is 2. +callback_handler = UpTrainCallbackHandler( + key_type="openai", + api_key=os.environ["OPENAI_API_KEY"], + project_name_prefix="llama", +) +Settings.callback_manager = CallbackManager([callback_handler]) + +rerank_postprocessor = SentenceTransformerRerank( + top_n=2, # Number of nodes after re-ranking + keep_retrieval_score=True, +) index = VectorStoreIndex.from_documents( - documents=documents, service_context=service_context + documents=documents, ) query_engine = index.as_query_engine( - similarity_top_k=10, - node_postprocessors=[cohere_rerank], - service_context=service_context, + similarity_top_k=5, # Number of nodes before re-ranking + node_postprocessors=[rerank_postprocessor], ) # Use your advanced RAG @@ -344,18 +423,20 @@ response = query_engine.query( ``` Question: What did Sam Altman do in this essay? - Context Conciseness Score: 1.0 + Context Conciseness Score: 0.0 -# UpTrain's Managed Service Dashboard and Insights -The UpTrain Managed Service offers the following features: + Question: What did Sam Altman do in this essay? + Response: Sam Altman offered unsolicited advice to the author during a visit to California for interviews. + -1. Advanced dashboards with drill-down and filtering options. -1. Identification of insights and common themes among unsuccessful cases. -1. Real-time observability and monitoring of production data. -1. Integration with CI/CD pipelines for seamless regression testing. + Context Relevance Score: 1.0 + Factual Accuracy Score: 1.0 + Response Completeness Score: 0.5 + +# UpTrain's Managed Service Dashboard and Insights -To define the UpTrain callback handler, the only change required is to set the `key_type` and `api_key` parameters. The rest of the code remains the same. +To use the UpTrain's managed service via the UpTrain callback handler, the only change required is to set the `key_type` and `api_key` parameters. The rest of the code remains the same. ```python callback_handler = UpTrainCallbackHandler( diff --git a/docs/examples/callbacks/UpTrainCallback.ipynb b/docs/examples/callbacks/UpTrainCallback.ipynb index 7ccc72565c127e6310f49d59ce41f9c0780474f1..d4fb0947a7a3122ac4bbf9f41bc58319b6285aed 100644 --- a/docs/examples/callbacks/UpTrainCallback.ipynb +++ b/docs/examples/callbacks/UpTrainCallback.ipynb @@ -13,30 +13,30 @@ "source": [ "# UpTrain Callback Handler\n", "\n", - "This notebook showcases the UpTrain callback handler seamlessly integrating into your pipeline, facilitating diverse evaluations. Three additional evaluations for Llamaindex have been introduced, complementing existing ones. These evaluations run automatically, with results displayed in the output. More details on UpTrain's evaluations can be found [here](https://github.com/uptrain-ai/uptrain?tab=readme-ov-file#pre-built-evaluations-we-offer-). \n", + "UpTrain ([github](https://github.com/uptrain-ai/uptrain) || [website](https://github.com/uptrain-ai/uptrain/) || [docs](https://docs.uptrain.ai/)) is an open-source platform to evaluate and improve GenAI applications. It provides grades for 20+ preconfigured checks (covering language, code, embedding use cases), performs root cause analysis on failure cases and gives insights on how to resolve them. \n", "\n", - "Selected operators from the LlamaIndex pipeline are highlighted for demonstration:\n", + "This notebook showcases how to use UpTrain Callback Handler to evaluate different components of your RAG pipelines.\n", "\n", "## 1. **RAG Query Engine Evaluations**:\n", "The RAG query engine plays a crucial role in retrieving context and generating responses. To ensure its performance and response quality, we conduct the following evaluations:\n", "\n", - "- **Context Relevance**: Determines if the context extracted from the query is relevant to the response.\n", - "- **Factual Accuracy**: Assesses if the LLM is hallcuinating or providing incorrect information.\n", - "- **Response Completeness**: Checks if the response contains all the information requested by the query.\n", + "- **[Context Relevance](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-relevance)**: Determines if the retrieved context has sufficient information to answer the user query or not.\n", + "- **[Factual Accuracy](https://docs.uptrain.ai/predefined-evaluations/context-awareness/factual-accuracy)**: Assesses if the LLM's response can be verified via the retrieved context.\n", + "- **[Response Completeness](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-completeness)**: Checks if the response contains all the information required to answer the user query comprehensively.\n", "\n", "## 2. **Sub-Question Query Generation Evaluation**:\n", - "The SubQuestionQueryGeneration operator decomposes a question into sub-questions, generating responses for each using a RAG query engine. Given the complexity, we include the previous evaluations and add:\n", + "The SubQuestionQueryGeneration operator decomposes a question into sub-questions, generating responses for each using an RAG query engine. To measure it's accuracy, we use:\n", "\n", - "- **Sub Query Completeness**: Assures that the sub-questions accurately and comprehensively cover the original query.\n", + "- **[Sub Query Completeness](https://docs.uptrain.ai/predefined-evaluations/query-quality/sub-query-completeness)**: Assures that the sub-questions accurately and comprehensively cover the original query.\n", "\n", "## 3. **Re-Ranking Evaluations**:\n", - "Re-ranking involves reordering nodes based on relevance to the query and chosing top n nodes. Different evaluations are performed based on the number of nodes returned after re-ranking.\n", + "Re-ranking involves reordering nodes based on relevance to the query and choosing the top nodes. Different evaluations are performed based on the number of nodes returned after re-ranking.\n", "\n", "a. Same Number of Nodes\n", - "- **Context Reranking**: Checks if the order of re-ranked nodes is more relevant to the query than the original order.\n", + "- **[Context Reranking](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-reranking)**: Checks if the order of re-ranked nodes is more relevant to the query than the original order.\n", "\n", "b. Different Number of Nodes:\n", - "- **Context Conciseness**: Examines whether the reduced number of nodes still provides all the required information.\n", + "- **[Context Conciseness](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-conciseness)**: Examines whether the reduced number of nodes still provides all the required information.\n", "\n", "These evaluations collectively ensure the robustness and effectiveness of the RAG query engine, SubQuestionQueryGeneration operator, and the re-ranking process in the LlamaIndex pipeline." ] @@ -47,7 +47,7 @@ "source": [ "#### **Note:** \n", "- We have performed evaluations using basic RAG query engine, the same evaluations can be performed using the advanced RAG query engine as well.\n", - "- Same is true for Re-Ranking evaluations, we have performed evaluations using CohereRerank, the same evaluations can be performed using other re-rankers as well." + "- Same is true for Re-Ranking evaluations, we have performed evaluations using SentenceTransformerRerank, the same evaluations can be performed using other re-rankers as well." ] }, { @@ -65,10 +65,9 @@ "metadata": {}, "outputs": [], "source": [ - "%pip install llama-index-postprocessor-cohere-rerank\n", "%pip install llama-index-readers-web\n", - "%pip install llama-index-callback-uptrain\n", - "%pip install -q html2text llama-index pandas tqdm uptrain cohere" + "%pip install llama-index-callbacks-uptrain\n", + "%pip install -q html2text llama-index pandas tqdm uptrain torch sentence-transformers" ] }, { @@ -84,15 +83,14 @@ "metadata": {}, "outputs": [], "source": [ - "from llama_index.core.settings import Settings\n", - "from llama_index.core import VectorStoreIndex\n", + "from llama_index.core import Settings, VectorStoreIndex\n", "from llama_index.core.node_parser import SentenceSplitter\n", "from llama_index.readers.web import SimpleWebPageReader\n", "from llama_index.core.callbacks import CallbackManager\n", "from llama_index.callbacks.uptrain.base import UpTrainCallbackHandler\n", "from llama_index.core.query_engine import SubQuestionQueryEngine\n", "from llama_index.core.tools import QueryEngineTool, ToolMetadata\n", - "from llama_index.core.postprocessor.llm_rerank import LLMRerank\n", + "from llama_index.core.postprocessor import SentenceTransformerRerank\n", "from llama_index.llms.openai import OpenAI\n", "\n", "import os" @@ -141,11 +139,16 @@ "metadata": {}, "outputs": [], "source": [ + "os.environ[\n", + " \"OPENAI_API_KEY\"\n", + "] = \"sk-************\" # Replace with your OpenAI API key\n", + "\n", "callback_handler = UpTrainCallbackHandler(\n", " key_type=\"openai\",\n", - " api_key=\"sk-...\", # replace with your OpenAI API key\n", + " api_key=os.environ[\"OPENAI_API_KEY\"],\n", " project_name_prefix=\"llama\",\n", ")\n", + "\n", "Settings.callback_manager = CallbackManager([callback_handler])" ] }, @@ -200,9 +203,9 @@ "metadata": {}, "source": [ "UpTrain callback handler will automatically capture the query, context and response once generated and will run the following three evaluations *(Graded from 0 to 1)* on the response:\n", - "- **Context Relevance**: Check if the context extractedfrom the query is relevant to the response.\n", - "- **Factual Accuracy**: Check how factually accurate the response is.\n", - "- **Response Completeness**: Check if the response contains all the information that the query is asking for." + "- **[Context Relevance](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-relevance)**: Determines if the retrieved context has sufficient information to answer the user query or not.\n", + "- **[Factual Accuracy](https://docs.uptrain.ai/predefined-evaluations/context-awareness/factual-accuracy)**: Assesses if the LLM's response can be verified via the retrieved context.\n", + "- **[Response Completeness](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-completeness)**: Checks if the response contains all the information required to answer the user query comprehensively." ] }, { @@ -214,7 +217,10 @@ "name": "stderr", "output_type": "stream", "text": [ - "\u001b[32m2024-02-14 16:04:09.869\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36muptrain.framework.evalllm\u001b[0m:\u001b[36mevaluate\u001b[0m:\u001b[36m110\u001b[0m - \u001b[1mSending evaluation request for rows 0 to <50 to the Uptrain\u001b[0m\n" + "100%|██████████| 1/1 [00:01<00:00, 1.33s/it]\n", + "100%|██████████| 1/1 [00:01<00:00, 1.36s/it]\n", + "100%|██████████| 1/1 [00:03<00:00, 3.50s/it]\n", + "100%|██████████| 1/1 [00:01<00:00, 1.32s/it]\n" ] }, { @@ -223,8 +229,9 @@ "text": [ "\n", "Question: What did Paul Graham do growing up?\n", - "Response: Growing up, Paul Graham worked on writing and programming. He wrote short stories and also tried his hand at programming on the IBM 1401 computer that his school district had. He later got a microcomputer, a TRS-80, and started programming more extensively, creating simple games and even a word processor.\n", - "Context Relevance Score: 0.5\n", + "Response: Growing up, Paul Graham worked on writing short stories and programming. He started programming on an IBM 1401 in 9th grade using an early version of Fortran. Later, he got a TRS-80 computer and wrote simple games, a rocket prediction program, and a word processor. Despite his interest in programming, he initially planned to study philosophy in college before eventually switching to AI.\n", + "\n", + "Context Relevance Score: 0.0\n", "Factual Accuracy Score: 1.0\n", "Response Completeness Score: 1.0\n", "\n" @@ -234,7 +241,10 @@ "name": "stderr", "output_type": "stream", "text": [ - "\u001b[32m2024-02-14 16:04:36.895\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36muptrain.framework.evalllm\u001b[0m:\u001b[36mevaluate\u001b[0m:\u001b[36m110\u001b[0m - \u001b[1mSending evaluation request for rows 0 to <50 to the Uptrain\u001b[0m\n" + "100%|██████████| 1/1 [00:01<00:00, 1.59s/it]\n", + "100%|██████████| 1/1 [00:00<00:00, 1.01it/s]\n", + "100%|██████████| 1/1 [00:01<00:00, 1.76s/it]\n", + "100%|██████████| 1/1 [00:01<00:00, 1.28s/it]\n" ] }, { @@ -243,10 +253,11 @@ "text": [ "\n", "Question: When and how did Paul Graham's mother die?\n", - "Response: The context information does not provide any information about Paul Graham's mother or her death.\n", + "Response: Paul Graham's mother died when he was 18 years old, from a brain tumor.\n", + "\n", "Context Relevance Score: 0.0\n", "Factual Accuracy Score: 0.0\n", - "Response Completeness Score: 0.0\n", + "Response Completeness Score: 0.5\n", "\n" ] }, @@ -254,7 +265,10 @@ "name": "stderr", "output_type": "stream", "text": [ - "\u001b[32m2024-02-14 16:04:55.245\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36muptrain.framework.evalllm\u001b[0m:\u001b[36mevaluate\u001b[0m:\u001b[36m110\u001b[0m - \u001b[1mSending evaluation request for rows 0 to <50 to the Uptrain\u001b[0m\n" + "100%|██████████| 1/1 [00:01<00:00, 1.75s/it]\n", + "100%|██████████| 1/1 [00:01<00:00, 1.55s/it]\n", + "100%|██████████| 1/1 [00:03<00:00, 3.39s/it]\n", + "100%|██████████| 1/1 [00:01<00:00, 1.48s/it]\n" ] }, { @@ -263,10 +277,11 @@ "text": [ "\n", "Question: What, in Paul Graham's opinion, is the most distinctive thing about YC?\n", - "Response: The most distinctive thing about YC, according to Paul Graham's opinion, is that it provides a sense of community and support for startup founders. It solves the problem of isolation that founders often face by connecting them with colleagues who understand the challenges they are going through and can offer guidance and support. Additionally, YC fosters a tight-knit alumni community where startups can help each other and even become each other's customers.\n", - "Context Relevance Score: 0.0\n", - "Factual Accuracy Score: 1.0\n", - "Response Completeness Score: 0.5\n", + "Response: The most distinctive thing about Y Combinator, according to Paul Graham, is that instead of deciding for himself what to work on, the problems come to him. Every 6 months, a new batch of startups brings their problems, which then become the focus of YC. This engagement with a variety of startup problems and the direct involvement in solving them is what Graham finds most unique about Y Combinator.\n", + "\n", + "Context Relevance Score: 1.0\n", + "Factual Accuracy Score: 0.3333333333333333\n", + "Response Completeness Score: 1.0\n", "\n" ] }, @@ -274,7 +289,10 @@ "name": "stderr", "output_type": "stream", "text": [ - "\u001b[32m2024-02-14 16:05:24.705\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36muptrain.framework.evalllm\u001b[0m:\u001b[36mevaluate\u001b[0m:\u001b[36m110\u001b[0m - \u001b[1mSending evaluation request for rows 0 to <50 to the Uptrain\u001b[0m\n" + "100%|██████████| 1/1 [00:01<00:00, 1.92s/it]\n", + "100%|██████████| 1/1 [00:00<00:00, 1.20it/s]\n", + "100%|██████████| 1/1 [00:02<00:00, 2.15s/it]\n", + "100%|██████████| 1/1 [00:01<00:00, 1.08s/it]\n" ] }, { @@ -283,9 +301,10 @@ "text": [ "\n", "Question: When and how did Paul Graham meet Jessica Livingston?\n", - "Response: Paul Graham met Jessica Livingston at a party at his house in October 2003. They were introduced to each other by a mutual friend named Maria Daniels. A couple of days later, Paul asked Jessica out and they started dating.\n", - "Context Relevance Score: 0.5\n", - "Factual Accuracy Score: 1.0\n", + "Response: Paul Graham met Jessica Livingston at a big party at his house in October 2003.\n", + "\n", + "Context Relevance Score: 1.0\n", + "Factual Accuracy Score: 0.5\n", "Response Completeness Score: 1.0\n", "\n" ] @@ -294,7 +313,10 @@ "name": "stderr", "output_type": "stream", "text": [ - "\u001b[32m2024-02-14 16:05:52.062\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36muptrain.framework.evalllm\u001b[0m:\u001b[36mevaluate\u001b[0m:\u001b[36m110\u001b[0m - \u001b[1mSending evaluation request for rows 0 to <50 to the Uptrain\u001b[0m\n" + "100%|██████████| 1/1 [00:01<00:00, 1.82s/it]\n", + "100%|██████████| 1/1 [00:01<00:00, 1.14s/it]\n", + "100%|██████████| 1/1 [00:03<00:00, 3.19s/it]\n", + "100%|██████████| 1/1 [00:01<00:00, 1.50s/it]" ] }, { @@ -303,10 +325,18 @@ "text": [ "\n", "Question: What is Bel, and when and where was it written?\n", - "Response: Bel is a new Lisp that was written in Arc. It was written over a period of 4 years, from March 26, 2015, to October 12, 2019. The majority of Bel was written in England, as the author moved there in the summer of 2016.\n", + "Response: Bel is a new Lisp that was written in Arc. It was developed over a period of 4 years, from March 26, 2015 to October 12, 2019. The majority of Bel was written in England.\n", + "\n", "Context Relevance Score: 1.0\n", "Factual Accuracy Score: 1.0\n", - "Response Completeness Score: 0.5\n", + "Response Completeness Score: 1.0\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ "\n" ] } @@ -348,15 +378,15 @@ "source": [ "# 2. Sub-Question Query Engine Evaluation\n", "\n", - "The **sub question query engine** is used to tackle the problem of answering a complex query using multiple data sources. It first breaks down the complex query into sub questions for each relevant data source, then gather all the intermediate reponses and synthesizes a final response.\n", + "The **sub-question query engine** is used to tackle the problem of answering a complex query using multiple data sources. It first breaks down the complex query into sub-questions for each relevant data source, then gathers all the intermediate responses and synthesizes a final response.\n", "\n", "UpTrain callback handler will automatically capture the sub-question and the responses for each of them once generated and will run the following three evaluations *(Graded from 0 to 1)* on the response:\n", - "- **Context Relevance**: Check if the context extractedfrom the query is relevant to the response.\n", - "- **Factual Accuracy**: Check how factually accurate the response is.\n", - "- **Response Completeness**: Check if the response contains all the information that the query is asking for.\n", + "- **[Context Relevance](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-relevance)**: Determines if the retrieved context has sufficient information to answer the user query or not.\n", + "- **[Factual Accuracy](https://docs.uptrain.ai/predefined-evaluations/context-awareness/factual-accuracy)**: Assesses if the LLM's response can be verified via the retrieved context.\n", + "- **[Response Completeness](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-completeness)**: Checks if the response contains all the information required to answer the user query comprehensively.\n", "\n", "In addition to the above evaluations, the callback handler will also run the following evaluation:\n", - "- **Sub Query Completeness**: Checks if the sub-questions accurately and completely cover the original query." + "- **[Sub Query Completeness](https://docs.uptrain.ai/predefined-evaluations/query-quality/sub-query-completeness)**: Assures that the sub-questions accurately and comprehensively cover the original query." ] }, { @@ -372,9 +402,9 @@ "\u001b[1;3;38;2;237;90;200m[documents] Q: What did Paul Graham work on before YC?\n", "\u001b[0m\u001b[1;3;38;2;90;149;237m[documents] Q: What did Paul Graham work on during YC?\n", "\u001b[0m\u001b[1;3;38;2;11;159;203m[documents] Q: What did Paul Graham work on after YC?\n", - "\u001b[0m\u001b[1;3;38;2;237;90;200m[documents] A: Before Y Combinator (YC), Paul Graham worked on a startup called Viaweb.\n", - "\u001b[0m\u001b[1;3;38;2;11;159;203m[documents] A: After leaving Y Combinator, Paul Graham focused on painting. He wanted to see how good he could get at painting if he dedicated his time and effort to it. He spent most of 2014 working on his painting skills, but eventually ran out of steam in November.\n", - "\u001b[0m\u001b[1;3;38;2;90;149;237m[documents] A: During his time at Y Combinator (YC), Paul Graham worked on various projects. He initially intended to work on three things: hacking, writing essays, and working on YC. However, as YC grew and he became more excited about it, it started to take up a lot more of his attention. He also worked on writing essays and was responsible for writing all of YC's internal software in Arc.\n", + "\u001b[0m\u001b[1;3;38;2;11;159;203m[documents] A: After Y Combinator, Paul Graham decided to focus on painting as his next endeavor.\n", + "\u001b[0m\u001b[1;3;38;2;90;149;237m[documents] A: Paul Graham worked on writing essays and working on Y Combinator during YC.\n", + "\u001b[0m\u001b[1;3;38;2;237;90;200m[documents] A: Before Y Combinator, Paul Graham worked on projects with his colleagues Robert and Trevor.\n", "\u001b[0m" ] }, @@ -382,40 +412,65 @@ "name": "stderr", "output_type": "stream", "text": [ - "\u001b[32m2024-02-14 08:24:08.958\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36muptrain.framework.evalllm\u001b[0m:\u001b[36mevaluate\u001b[0m:\u001b[36m110\u001b[0m - \u001b[1mSending evaluation request for rows 0 to <50 to the Uptrain\u001b[0m\n", - "\u001b[32m2024-02-14 08:24:34.450\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36muptrain.framework.evalllm\u001b[0m:\u001b[36mevaluate\u001b[0m:\u001b[36m110\u001b[0m - \u001b[1mSending evaluation request for rows 0 to <50 to the Uptrain\u001b[0m\n" + "100%|██████████| 3/3 [00:02<00:00, 1.47it/s]\n", + "100%|██████████| 3/3 [00:00<00:00, 3.28it/s]\n", + "100%|██████████| 3/3 [00:01<00:00, 1.68it/s]\n", + "100%|██████████| 3/3 [00:01<00:00, 2.28it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "\n", - "Question: What did Paul Graham work on before YC?\n", - "Response: Before Y Combinator (YC), Paul Graham worked on a startup called Viaweb.\n", - "Context Relevance Score: 0.0\n", - "Factual Accuracy Score: 1.0\n", - "Response Completeness Score: 0.5\n", - "\n", "\n", "Question: What did Paul Graham work on after YC?\n", - "Response: After leaving Y Combinator, Paul Graham focused on painting. He wanted to see how good he could get at painting if he dedicated his time and effort to it. He spent most of 2014 working on his painting skills, but eventually ran out of steam in November.\n", - "Context Relevance Score: 1.0\n", + "Response: After Y Combinator, Paul Graham decided to focus on painting as his next endeavor.\n", + "\n", + "Context Relevance Score: 0.0\n", "Factual Accuracy Score: 0.0\n", - "Response Completeness Score: 0.0\n", + "Response Completeness Score: 0.5\n", "\n", "\n", "Question: What did Paul Graham work on during YC?\n", - "Response: During his time at Y Combinator (YC), Paul Graham worked on various projects. He initially intended to work on three things: hacking, writing essays, and working on YC. However, as YC grew and he became more excited about it, it started to take up a lot more of his attention. He also worked on writing essays and was responsible for writing all of YC's internal software in Arc.\n", - "Context Relevance Score: 0.5\n", + "Response: Paul Graham worked on writing essays and working on Y Combinator during YC.\n", + "\n", + "Context Relevance Score: 0.0\n", "Factual Accuracy Score: 1.0\n", "Response Completeness Score: 0.5\n", "\n", + "\n", + "Question: What did Paul Graham work on before YC?\n", + "Response: Before Y Combinator, Paul Graham worked on projects with his colleagues Robert and Trevor.\n", + "\n", + "Context Relevance Score: 0.0\n", + "Factual Accuracy Score: 0.0\n", + "Response Completeness Score: 0.5\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:01<00:00, 1.24s/it]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ "\n", "Question: How was Paul Grahams life different before, during, and after YC?\n", "Sub Query Completeness Score: 1.0\n", "\n" ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n" + ] } ], "source": [ @@ -465,7 +520,7 @@ "source": [ "# 3. Re-ranking \n", "\n", - "Re-ranking is the process of reordering the nodes based on their relevance to the query. There are multiple classes of re-ranking algorithms offered by Llamaindex. We have used CohereRerank for this example.\n", + "Re-ranking is the process of reordering the nodes based on their relevance to the query. There are multiple classes of re-ranking algorithms offered by Llamaindex. We have used LLMRerank for this example.\n", "\n", "The re-ranker allows you to enter the number of top n nodes that will be returned after re-ranking. If this value remains the same as the original number of nodes, the re-ranker will only re-rank the nodes and not change the number of nodes. Otherwise, it will re-rank the nodes and return the top n nodes.\n", "\n", @@ -479,7 +534,8 @@ "## 3a. Re-ranking (With same number of nodes)\n", "\n", "If the number of nodes returned after re-ranking is the same as the original number of nodes, the following evaluation will be performed:\n", - "- **Context Reranking**: Check if the order of the re-ranked nodes is more relevant to the query than the original order." + "\n", + "- **[Context Reranking](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-reranking)**: Checks if the order of re-ranked nodes is more relevant to the query than the original order." ] }, { @@ -491,7 +547,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "\u001b[32m2024-02-13 20:00:17.459\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36muptrain.framework.evalllm\u001b[0m:\u001b[36mevaluate\u001b[0m:\u001b[36m110\u001b[0m - \u001b[1mSending evaluation request for rows 0 to <50 to the Uptrain\u001b[0m\n" + "100%|██████████| 1/1 [00:01<00:00, 1.89s/it]\n" ] }, { @@ -500,28 +556,62 @@ "text": [ "\n", "Question: What did Sam Altman do in this essay?\n", + "Context Reranking Score: 1.0\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:01<00:00, 1.88s/it]\n", + "100%|██████████| 1/1 [00:01<00:00, 1.44s/it]\n", + "100%|██████████| 1/1 [00:02<00:00, 2.77s/it]\n", + "100%|██████████| 1/1 [00:01<00:00, 1.45s/it]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Question: What did Sam Altman do in this essay?\n", + "Response: Sam Altman was asked to become the president of Y Combinator after the original founders decided to step down and reorganize the company for long-term sustainability.\n", + "\n", "Context Relevance Score: 1.0\n", "Factual Accuracy Score: 1.0\n", - "Response Completeness Score: 1.0\n", + "Response Completeness Score: 0.5\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ "\n" ] } ], "source": [ - "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\" # Replace with your OpenAI API key\n", - "llm = OpenAI(model=\"gpt-4-turbo-preview\")\n", + "callback_handler = UpTrainCallbackHandler(\n", + " key_type=\"openai\",\n", + " api_key=os.environ[\"OPENAI_API_KEY\"],\n", + " project_name_prefix=\"llama\",\n", + ")\n", + "Settings.callback_manager = CallbackManager([callback_handler])\n", "\n", - "cohere_rerank = LLMRerank(\n", - " llm=llm, top_n=5\n", - ") # In this example, the number of nodes before re-ranking is 5 and after re-ranking is also 5.\n", + "rerank_postprocessor = SentenceTransformerRerank(\n", + " top_n=3, # number of nodes after reranking\n", + " keep_retrieval_score=True,\n", + ")\n", "\n", "index = VectorStoreIndex.from_documents(\n", " documents=documents,\n", ")\n", "\n", "query_engine = index.as_query_engine(\n", - " similarity_top_k=10,\n", - " node_postprocessors=[cohere_rerank],\n", + " similarity_top_k=3, # number of nodes before reranking\n", + " node_postprocessors=[rerank_postprocessor],\n", ")\n", "\n", "response = query_engine.query(\n", @@ -536,7 +626,8 @@ "# 3b. Re-ranking (With different number of nodes)\n", "\n", "If the number of nodes returned after re-ranking is the lesser as the original number of nodes, the following evaluation will be performed:\n", - "- **Context Conciseness**: If the re-ranked nodes are able to provide all the information required by the query." + "\n", + "- **[Context Conciseness](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-conciseness)**: Examines whether the reduced number of nodes still provides all the required information." ] }, { @@ -548,7 +639,27 @@ "name": "stderr", "output_type": "stream", "text": [ - "\u001b[32m2024-02-13 20:01:39.343\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36muptrain.framework.evalllm\u001b[0m:\u001b[36mevaluate\u001b[0m:\u001b[36m110\u001b[0m - \u001b[1mSending evaluation request for rows 0 to <50 to the Uptrain\u001b[0m\n" + "100%|██████████| 1/1 [00:02<00:00, 2.22s/it]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Question: What did Sam Altman do in this essay?\n", + "Context Conciseness Score: 0.0\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:01<00:00, 1.58s/it]\n", + "100%|██████████| 1/1 [00:00<00:00, 1.19it/s]\n", + "100%|██████████| 1/1 [00:01<00:00, 1.62s/it]\n", + "100%|██████████| 1/1 [00:01<00:00, 1.42s/it]" ] }, { @@ -557,27 +668,41 @@ "text": [ "\n", "Question: What did Sam Altman do in this essay?\n", - "Context Relevance Score: 0.5\n", + "Response: Sam Altman offered unsolicited advice to the author during a visit to California for interviews.\n", + "\n", + "Context Relevance Score: 0.0\n", "Factual Accuracy Score: 1.0\n", - "Response Completeness Score: 1.0\n", + "Response Completeness Score: 0.5\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ "\n" ] } ], "source": [ - "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\" # Replace with your OpenAI API key\n", - "llm = OpenAI(model=\"gpt-4-turbo-preview\")\n", + "callback_handler = UpTrainCallbackHandler(\n", + " key_type=\"openai\",\n", + " api_key=os.environ[\"OPENAI_API_KEY\"],\n", + " project_name_prefix=\"llama\",\n", + ")\n", + "Settings.callback_manager = CallbackManager([callback_handler])\n", "\n", - "cohere_rerank = LLMRerank(\n", - " llm=llm, top_n=2\n", - ") # In this example, the number of nodes before re-ranking is 5 and after re-ranking is 2.\n", + "rerank_postprocessor = SentenceTransformerRerank(\n", + " top_n=2, # Number of nodes after re-ranking\n", + " keep_retrieval_score=True,\n", + ")\n", "\n", "index = VectorStoreIndex.from_documents(\n", " documents=documents,\n", ")\n", "query_engine = index.as_query_engine(\n", - " similarity_top_k=10,\n", - " node_postprocessors=[cohere_rerank],\n", + " similarity_top_k=5, # Number of nodes before re-ranking\n", + " node_postprocessors=[rerank_postprocessor],\n", ")\n", "\n", "# Use your advanced RAG\n", @@ -592,14 +717,7 @@ "source": [ "# UpTrain's Managed Service Dashboard and Insights\n", "\n", - "The UpTrain Managed Service offers the following features:\n", - "\n", - "1. Advanced dashboards with drill-down and filtering options.\n", - "1. Identification of insights and common themes among unsuccessful cases.\n", - "1. Real-time observability and monitoring of production data.\n", - "1. Integration with CI/CD pipelines for seamless regression testing.\n", - "\n", - "To define the UpTrain callback handler, the only change required is to set the `key_type` and `api_key` parameters. The rest of the code remains the same.\n", + "To use the UpTrain's managed service via the UpTrain callback handler, the only change required is to set the `key_type` and `api_key` parameters. The rest of the code remains the same.\n", "\n", "```python\n", "callback_handler = UpTrainCallbackHandler(\n", @@ -622,7 +740,7 @@ ], "metadata": { "kernelspec": { - "display_name": "phoenixdev", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -639,5 +757,5 @@ } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 } diff --git a/docs/examples/evaluation/UpTrain.ipynb b/docs/examples/evaluation/UpTrain.ipynb index 6cb85c2340c9ed385d516009368084ba1b250541..eacb48b43d2f8322b6e67a510639f9cf686cb0f4 100644 --- a/docs/examples/evaluation/UpTrain.ipynb +++ b/docs/examples/evaluation/UpTrain.ipynb @@ -21,7 +21,7 @@ "id": "0958c248", "metadata": {}, "source": [ - "**Overview**: In this example, we will see how to use UpTrain with LlamaIndex. " + "**Overview**: In this example, we will see how to use UpTrain with LlamaIndex. UpTrain ([github](https://github.com/uptrain-ai/uptrain) || [website](https://github.com/uptrain-ai/uptrain/) || [docs](https://docs.uptrain.ai/)) is an open-source platform to evaluate and improve GenAI applications. It provides grades for 20+ preconfigured checks (covering language, code, embedding use cases), performs root cause analysis on failure cases and gives insights on how to resolve them. More details on UpTrain's evaluations can be found [here](https://github.com/uptrain-ai/uptrain?tab=readme-ov-file#pre-built-evaluations-we-offer-).\n" ] }, { diff --git a/llama-index-integrations/callbacks/llama-index-callbacks-uptrain/README.md b/llama-index-integrations/callbacks/llama-index-callbacks-uptrain/README.md index f9c12773deaa9e6a48da42655096093797387e59..e2f48e4c2338ec2b3ff3cd30f602dac14f4ee794 100644 --- a/llama-index-integrations/callbacks/llama-index-callbacks-uptrain/README.md +++ b/llama-index-integrations/callbacks/llama-index-callbacks-uptrain/README.md @@ -1,8 +1,8 @@ # LlamaIndex Callbacks Integration: UpTrain -UpTrain is an open-source tool to evaluate and monitor the performance of language models. It provides a set of pre-built evaluations to assess the quality of responses generated by the model. Once you add UpTrainCallbackHandler to your existing LlamaIndex pipeline, it will take care of sending the generated responses to the UpTrain Managed Service for evaluations and display the results in the output. +UpTrain ([github](https://github.com/uptrain-ai/uptrain) || [website](https://uptrain.ai/) || [docs](https://docs.uptrain.ai/getting-started/introduction)) is an open-source platform to evaluate and improve Generative AI applications. It provides grades for 20+ preconfigured checks (covering language, code, embedding use cases), performs root cause analysis on failure cases and gives insights on how to resolve them. Once you add UpTrainCallbackHandler to your existing LlamaIndex pipeline, it will automatically capture the right data, run evaluations and display the results in the output. -Three additional evaluations for Llamaindex have been introduced, complementing existing ones. These evaluations run automatically, with results displayed in the output. More details on UpTrain's evaluations can be found [here](https://github.com/uptrain-ai/uptrain?tab=readme-ov-file#pre-built-evaluations-we-offer-). +More details on UpTrain's evaluations can be found [here](https://github.com/uptrain-ai/uptrain?tab=readme-ov-file#pre-built-evaluations-we-offer-). Selected operators from the LlamaIndex pipeline are highlighted for demonstration: @@ -10,26 +10,26 @@ Selected operators from the LlamaIndex pipeline are highlighted for demonstratio The RAG query engine plays a crucial role in retrieving context and generating responses. To ensure its performance and response quality, we conduct the following evaluations: -- **Context Relevance**: Determines if the context extracted from the query is relevant to the response. -- **Factual Accuracy**: Assesses if the LLM is hallcuinating or providing incorrect information. -- **Response Completeness**: Checks if the response contains all the information requested by the query. +- **[Context Relevance](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-relevance)**: Determines if the context extracted from the query is relevant to the response. +- **[Factual Accuracy](https://docs.uptrain.ai/predefined-evaluations/context-awareness/factual-accuracy)**: Assesses if the LLM is hallucinating or providing incorrect information. +- **[Response Completeness](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-completeness)**: Checks if the response contains all the information requested by the query. ## 2. **Sub-Question Query Generation Evaluation**: -The SubQuestionQueryGeneration operator decomposes a question into sub-questions, generating responses for each using a RAG query engine. Given the complexity, we include the previous evaluations and add: +The SubQuestionQueryGeneration operator decomposes a question into sub-questions, generating responses for each using a RAG query engine. To evaluate the performance of SubQuery module, we add another check as well as run the above three for all the sub-queries: -- **Sub Query Completeness**: Assures that the sub-questions accurately and comprehensively cover the original query. +- **[Sub Query Completeness](https://docs.uptrain.ai/predefined-evaluations/query-quality/sub-query-completeness)**: Assures that the sub-questions accurately and comprehensively cover the original query. ## 3. **Re-Ranking Evaluations**: -Re-ranking involves reordering nodes based on relevance to the query and choosing top n nodes. Different evaluations are performed based on the number of nodes returned after re-ranking. +Re-ranking involves reordering nodes based on relevance to the query and choosing the top n nodes. Different evaluations are performed based on the number of nodes returned after re-ranking. a. Same Number of Nodes -- **Context Reranking**: Checks if the order of re-ranked nodes is more relevant to the query than the original order. +- **[Context Reranking](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-reranking)**: Checks if the order of re-ranked nodes is more relevant to the query than the original order. b. Different Number of Nodes: -- **Context Conciseness**: Examines whether the reduced number of nodes still provides all the required information. +- **[Context Conciseness](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-conciseness)**: Examines whether the reduced number of nodes still provides all the required information. These evaluations collectively ensure the robustness and effectiveness of the RAG query engine, SubQuestionQueryGeneration operator, and the re-ranking process in the LlamaIndex pipeline. diff --git a/llama-index-integrations/callbacks/llama-index-callbacks-uptrain/llama_index/callbacks/uptrain/base.py b/llama-index-integrations/callbacks/llama-index-callbacks-uptrain/llama_index/callbacks/uptrain/base.py index 69ab6de8e509bc12eea1517f128d3f17c952e8ec..b842c6de18ea5e466ed7b6d83f9a94af5107a347 100644 --- a/llama-index-integrations/callbacks/llama-index-callbacks-uptrain/llama_index/callbacks/uptrain/base.py +++ b/llama-index-integrations/callbacks/llama-index-callbacks-uptrain/llama_index/callbacks/uptrain/base.py @@ -118,7 +118,7 @@ class UpTrainCallbackHandler(BaseCallbackHandler): if column == "question": print(f"\nQuestion: {row[column]}") elif column == "response": - print(f"Response: {row[column]}") + print(f"Response: {row[column]}\n") elif column.startswith("score"): if column in score_name_map: print(f"{score_name_map[column]}: {row[column]}") diff --git a/llama-index-integrations/callbacks/llama-index-callbacks-uptrain/pyproject.toml b/llama-index-integrations/callbacks/llama-index-callbacks-uptrain/pyproject.toml index 903e9f86fa358ccc638e2a32da72586efbc5e2ae..70b4ebe717f0a68a63593c861440aa31cdc620c9 100644 --- a/llama-index-integrations/callbacks/llama-index-callbacks-uptrain/pyproject.toml +++ b/llama-index-integrations/callbacks/llama-index-callbacks-uptrain/pyproject.toml @@ -31,8 +31,8 @@ version = "0.1.1" [tool.poetry.dependencies] python = ">=3.8.1,<4.0" -llama-index-core = "0.10.0" -uptrain = ">=0.5.0" +llama-index-core = ">=0.10.0" +uptrain = ">=0.6.5" [tool.poetry.group.dev.dependencies] ipython = "8.10.0"