Skip to content
Snippets Groups Projects
Unverified Commit 1052ee2d authored by Dave Mariano's avatar Dave Mariano Committed by GitHub
Browse files

Implement EvalQueryEngineTool (#11679)

parent e91c81dc
No related branches found
No related tags found
No related merge requests found
# ChangeLog # ChangeLog
## [2024-03-23]
### `llama-index-core` [0.10.23]
- Added `(a)predict_and_call()` function to base LLM class + openai + mistralai (#12188)
- fixed bug with `wait()` in async agent streaming (#12187)
### `llama-index-embeddings-alephalpha` [0.1.0]
- Added alephalpha embeddings (#12149)
### `llama-index-llms-alephalpha` [0.1.0]
- Added alephalpha LLM (#12149)
### `llama-index-llms-openai` [0.1.7]
- fixed bug with `wait()` in async agent streaming (#12187)
### `llama-index-readers-docugami` [0.1.4]
- fixed import errors in docugami reader (#12154)
### `llama-index-readers-file` [0.1.12]
- fix PDFReader for remote fs (#12186)
## [2024-03-21] ## [2024-03-21]
### `llama-index-core` [0.10.22] ### `llama-index-core` [0.10.22]
......
...@@ -375,22 +375,31 @@ Whether if it's the latest research, or what you thought of in the shower, we'd ...@@ -375,22 +375,31 @@ Whether if it's the latest research, or what you thought of in the shower, we'd
We would love your help in making the project cleaner, more robust, and more understandable. If you find something confusing, it most likely is for other people as well. Help us be better! We would love your help in making the project cleaner, more robust, and more understandable. If you find something confusing, it most likely is for other people as well. Help us be better!
## Development Guideline ## Development Guidelines
### Environment Setup ### Setting up environment
LlamaIndex is a Python package. We've tested primarily with Python versions >= 3.8. Here's a quick LlamaIndex is a Python package. We've tested primarily with Python versions >= 3.8. Here's a quick
and dirty guide to getting your environment setup. and dirty guide to setting up your environment for local development.
First, create a fork of LlamaIndex, by clicking the "Fork" button on the [LlamaIndex Github page](https://github.com/jerryjliu/llama_index). 1. Fork [LlamaIndex Github repo][ghr]\* and clone it locally. (New to GitHub / git? Here's [how][frk].)
Following [these steps](https://docs.github.com/en/get-started/quickstart/fork-a-repo) for more details 2. In a terminal, `cd` into the directory of your local clone of your forked repo.
on how to fork the repo and clone the forked repo. 3. Install [pre-commit hooks][pch]\* by running `pre-commit install`. These hooks are small house-keeping scripts executed every time you make a git commit, which automates away a lot of chores.
4. `cd` into the specific package you want to work on. For example, if I want to work on the core package, I execute `cd llama-index-core/`. (New to terminal / command line? Here's a [getting started guide][gsg].)
Then, create a new Python virtual environment using poetry. 5. Prepare a [virtual environment][vev].
1. [Install Poetry][pet]\*. This will help you manage package dependencies.
- [Install poetry](https://python-poetry.org/docs/#installation) - this will help you manage package dependencies 2. Execute `poetry shell`. This command will create a [virtual environment][vev] specific for this package, which keeps installed packages contained to this project. (New to Poetry, the dependency & packaging manager for Python? Read about its basic usage [here][bus].)
- `poetry shell` - this command creates a virtual environment, which keeps installed packages contained to this project 3. Execute `poetry install --with dev,docs`\*. This will install all dependencies needed for local development. To see what will be installed, read the `pyproject.toml` under that directory.
- `poetry install --with dev,docs` - this will install all dependencies needed for most local development
[frk]: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo
[ghr]: https://github.com/run-llama/llama_index/
[pch]: https://pre-commit.com/
[gsg]: https://www.freecodecamp.org/news/command-line-for-beginners/
[pet]: https://python-poetry.org/docs/#installation
[vev]: https://python-poetry.org/docs/managing-environments/
[bus]: https://python-poetry.org/docs/basic-usage/
Steps marked with an asterisk (`*`) are one-time tasks. You don't have to repeat them when you attempt to contribute on something else next time.
Now you should be set! Now you should be set!
...@@ -401,41 +410,52 @@ let's also make sure to `test` it and perhaps create an `example notebook`. ...@@ -401,41 +410,52 @@ let's also make sure to `test` it and perhaps create an `example notebook`.
#### Formatting/Linting #### Formatting/Linting
You can format and lint your changes with the following commands in the root directory: We run an assortment of linters: `black`, `ruff`, `mypy`.
```bash If you have installed pre-commit hooks in this repo, they should have taken care of the formatting and linting automatically.
make format; make lint
```
You can also make use of our pre-commit hooks by setting up git hook scripts: If -- for whatever reason -- you would like to do it manually, you can format and lint your changes with the following commands in the root directory:
```bash ```bash
pre-commit install make format; make lint
``` ```
We run an assortment of linters: `black`, `ruff`, `mypy`. Under the hood, we still install pre-commit hooks for you, so that you don't have to do this manually next time.
#### Testing #### Testing
For bigger changes, you'll want to create a unit test. Our tests are in the `tests` folder. If you modified or added code logic, **create test(s)**, because they help preventing other maintainers from accidentally breaking the nice things you added / re-introducing the bugs you fixed.
We use `pytest` for unit testing. To run all unit tests, run the following in the root dir:
```bash - In almost all cases, add **unit tests**.
pytest tests - If your change involves adding a new integration, also add **integration tests**. When doing so, please [mock away][mck] the remote system that you're integrating LlamaIndex with, so that when the remote system changes, LlamaIndex developers won't see test failures.
```
or Reciprocally, you should **run existing tests** (from every package that you touched) before making a git commit, so that you can be sure you didn't break someone else's good work.
```bash (By the way, when a test is run with the goal of detecting whether something broke in a new version of the codebase, it's referred to as a "[regression test][reg]". You'll also hear people say "the test _regressed_" as a more diplomatic way of saying "the test _failed_".)
Our tests are stored in the `tests` folders under each package directory. We use the testing framework [pytest][pyt], so you can **just run `pytest` in each package you touched** to run all its tests.
Just like with formatting and linting, if you prefer to do things the [make][mkf] way, run:
```shell
make test make test
``` ```
Regardless of whether you have run them locally, a [CI system][cis] will run all affected tests on your PR when you submit one anyway. There, tests are orchestrated with [Pants][pts], the build system of our choice. There is a slight chance that tests broke on CI didn't break on your local machine or the other way around. When that happens, please take our CI as the source of truth. This is because our release pipeline (which builds the packages users are going to download from PyPI) are run in the CI, not on your machine (even if you volunteer), so it's the CI that is the golden standard.
[reg]: https://www.browserstack.com/guide/regression-testing
[mck]: https://pytest-mock.readthedocs.io/en/latest/
[pyt]: https://docs.pytest.org/
[mkf]: https://makefiletutorial.com/
[cis]: https://www.atlassian.com/continuous-delivery/continuous-integration
[pts]: https://www.pantsbuild.org/
### Creating an Example Notebook ### Creating an Example Notebook
For changes that involve entirely new features, it may be worth adding an example Jupyter notebook to showcase For changes that involve entirely new features, it may be worth adding an example Jupyter notebook to showcase
this feature. this feature.
Example notebooks can be found in this folder: <https://github.com/run-llama/llama_index/tree/main/docs/examples>. Example notebooks can be found in [this folder](https://github.com/run-llama/llama_index/tree/main/docs/examples).
### Creating a pull request ### Creating a pull request
......
%% Cell type:markdown id:6b0186a4 tags:
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/tools/eval_query_engine_tool.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
%% Cell type:markdown id:b50c4af8-fec3-4396-860a-1322089d76cb tags:
# Evaluation Query Engine Tool
In this section we will show you how you can use an `EvalQueryEngineTool` with an agent. Some reasons you may want to use a `EvalQueryEngineTool`:
1. Use specific kind of evaluation for a tool, and not just the agent's reasoning
2. Use a different LLM for evaluating tool responses than the agent LLM
An `EvalQueryEngineTool` is built on top of the `QueryEngineTool`. Along with wrapping an existing [query engine](https://docs.llamaindex.ai/en/stable/module_guides/deploying/query_engine/root.html), it also must be given an existing [evaluator](https://docs.llamaindex.ai/en/stable/examples/evaluation/answer_and_context_relevancy.html) to evaluate the responses of that query engine.
%% Cell type:markdown id:db402a8b-90d6-4e1d-8df6-347c54624f26 tags:
## Install Dependencies
%% Cell type:code id:dd31aff7 tags:
``` python
%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-openai
%pip install llama-index-agents-openai
```
%% Cell type:code id:9f9fcf29 tags:
``` python
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
```
%% Cell type:markdown id:7603dec1 tags:
## Initialize and Set LLM and Local Embedding Model
%% Cell type:code id:05fd9050 tags:
``` python
from llama_index.core.settings import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI
Settings.embed_model = HuggingFaceEmbedding(
model_name="BAAI/bge-small-en-v1.5"
)
Settings.llm = OpenAI()
```
%% Cell type:markdown id:6c6bdb82 tags:
## Download and Index Data
This is something we are donig for the sake of this demo. In production environments, data stores and indexes should already exist and not be created on the fly.
%% Cell type:markdown id:64df0568 tags:
### Create Storage Contexts
%% Cell type:code id:91618236-54d3-4783-86b7-7b7554efeed1 tags:
``` python
from llama_index.core import (
StorageContext,
load_index_from_storage,
)
try:
storage_context = StorageContext.from_defaults(
persist_dir="./storage/lyft",
)
lyft_index = load_index_from_storage(storage_context)
storage_context = StorageContext.from_defaults(
persist_dir="./storage/uber"
)
uber_index = load_index_from_storage(storage_context)
index_loaded = True
except:
index_loaded = False
```
%% Cell type:markdown id:6a79cbc9 tags:
Download Data
%% Cell type:code id:36d80144 tags:
``` python
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'
```
%% Cell type:markdown id:4f801ac5 tags:
### Load Data
%% Cell type:code id:d3d0bb8c-16c8-4946-a9d8-59528cf3952a tags:
``` python
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
if not index_loaded:
# load data
lyft_docs = SimpleDirectoryReader(
input_files=["./data/10k/lyft_2021.pdf"]
).load_data()
uber_docs = SimpleDirectoryReader(
input_files=["./data/10k/uber_2021.pdf"]
).load_data()
# build index
lyft_index = VectorStoreIndex.from_documents(lyft_docs)
uber_index = VectorStoreIndex.from_documents(uber_docs)
# persist index
lyft_index.storage_context.persist(persist_dir="./storage/lyft")
uber_index.storage_context.persist(persist_dir="./storage/uber")
```
%% Cell type:markdown id:ccb89178 tags:
## Create Query Engines
%% Cell type:code id:31892898-a2dc-43c8-812a-3442feb2108d tags:
``` python
lyft_engine = lyft_index.as_query_engine(similarity_top_k=5)
uber_engine = uber_index.as_query_engine(similarity_top_k=5)
```
%% Cell type:markdown id:880c2007 tags:
## Create Evaluator
%% Cell type:code id:911235b3 tags:
``` python
from llama_index.core.evaluation import RelevancyEvaluator
evaluator = RelevancyEvaluator()
```
%% Cell type:markdown id:60c542c1 tags:
## Create Query Engine Tools
%% Cell type:code id:f9f3158a-7647-4442-8de1-4db80723b4d2 tags:
``` python
from llama_index.core.tools import ToolMetadata
from llama_index.core.tools.eval_query_engine import EvalQueryEngineTool
query_engine_tools = [
EvalQueryEngineTool(
evaluator=evaluator,
query_engine=lyft_engine,
metadata=ToolMetadata(
name="lyft",
description=(
"Provides information about Lyft's financials for year 2021. "
"Use a detailed plain text question as input to the tool."
),
),
),
EvalQueryEngineTool(
evaluator=evaluator,
query_engine=uber_engine,
metadata=ToolMetadata(
name="uber",
description=(
"Provides information about Uber's financials for year 2021. "
"Use a detailed plain text question as input to the tool."
),
),
),
]
```
%% Cell type:markdown id:275c01b1-8dce-4216-9203-1e961b7fc313 tags:
## Setup OpenAI Agent
%% Cell type:code id:ded93297-fee8-4329-bf37-cf77e87621ae tags:
``` python
from llama_index.agent.openai import OpenAIAgent
agent = OpenAIAgent.from_tools(query_engine_tools, verbose=True)
```
%% Cell type:markdown id:699ee1bb tags:
## Query Engine Fails Evaluation
For demonstration purposes, we will tell the agent to choose the wrong tool first so that we can observe the effect of the `EvalQueryEngineTool` when evaluation fails. To achieve this, we will `tool_choice` to `lyft` when calling the agent.
This is what we should expect to happen:
1. The agent will use the `lyft` tool first, which contains the wrong financials, as we have instructed it to do so
2. The `EvalQueryEngineTool` will evaluate the response of the query engine using its evaluator
3. The query engine output will fail evaluation because it contains Lyft's financials and not Uber's
4. The tool will form a response that informs the agent that the tool could not be used, giving a reason
5. The agent will fallback to the second tool, being `uber`
6. The query engine output of the second tool will pass evaluation because it contains Uber's financials
6. The agent will respond with an answer
%% Cell type:code id:70a82471-9226-42ad-bd8a-aebde3530d95 tags:
``` python
response = await agent.achat(
"What was Uber's revenue growth in 2021?", tool_choice="lyft"
)
print(str(response))
```
%% Output
Added user message to memory: What was Uber's revenue growth in 2021?
=== Calling Function ===
Calling function: lyft with args: {"input":"What was Uber's revenue growth in 2021?"}
Got output: Could not use tool lyft because it failed evaluation.
Reason: NO
========================
=== Calling Function ===
Calling function: uber with args: {"input":"What was Uber's revenue growth in 2021?"}
Got output: Uber's revenue grew by 57% in 2021.
========================
Uber's revenue grew by 57% in 2021.
%% Cell type:markdown id:48eec4e4 tags:
## Query Engine Passes Evaluation
Here we are asking a question about Lyft's financials. This is what we should expect to happen:
1. The agent will use the `lyftk` tool first, simply based on its description as we have **not** set `tool_choice` here
2. The `EvalQueryEngineTool` will evaluate the response of the query engine using its evaluator
3. The output of the query engine will pass evaluation because it contains Lyft's financials
%% Cell type:code id:7b114dd1 tags:
``` python
response = await agent.achat("What was Lyft's revenue growth in 2021?")
print(str(response))
```
%% Output
Added user message to memory: What was Lyft's revenue growth in 2021?
=== Calling Function ===
Calling function: lyft with args: {"input": "What was Lyft's revenue growth in 2021?"}
Got output: Lyft's revenue growth in 2021 was $3,208,323, which increased compared to the revenue in 2020 and 2019.
========================
=== Calling Function ===
Calling function: uber with args: {"input": "What was Lyft's revenue growth in 2021?"}
Got output: Could not use tool uber because it failed evaluation.
Reason: NO
========================
Lyft's revenue grew by $3,208,323 in 2021, which increased compared to the revenue in 2020 and 2019.
...@@ -77,6 +77,7 @@ nav: ...@@ -77,6 +77,7 @@ nav:
- ./examples/agent/openai_agent_query_plan.ipynb - ./examples/agent/openai_agent_query_plan.ipynb
- ./examples/agent/openai_retrieval_benchmark.ipynb - ./examples/agent/openai_retrieval_benchmark.ipynb
- ./examples/agent/agent_runner/agent_around_query_pipeline_with_HyDE_for_PDFs.ipynb - ./examples/agent/agent_runner/agent_around_query_pipeline_with_HyDE_for_PDFs.ipynb
- ./examples/agent/mistral_agent.ipynb
- Callbacks: - Callbacks:
- ./examples/callbacks/HoneyHiveLlamaIndexTracer.ipynb - ./examples/callbacks/HoneyHiveLlamaIndexTracer.ipynb
- ./examples/callbacks/PromptLayerHandler.ipynb - ./examples/callbacks/PromptLayerHandler.ipynb
...@@ -427,6 +428,7 @@ nav: ...@@ -427,6 +428,7 @@ nav:
- ./examples/retrievers/videodb_retriever.ipynb - ./examples/retrievers/videodb_retriever.ipynb
- Tools: - Tools:
- ./examples/tools/OnDemandLoaderTool.ipynb - ./examples/tools/OnDemandLoaderTool.ipynb
- ./examples/tools/eval_query_engine_tool.ipynb
- Transforms: - Transforms:
- ./examples/transforms/TransformsEval.ipynb - ./examples/transforms/TransformsEval.ipynb
- Use Cases: - Use Cases:
...@@ -814,6 +816,7 @@ nav: ...@@ -814,6 +816,7 @@ nav:
- ./api_reference/packs/ollama_query_engine.md - ./api_reference/packs/ollama_query_engine.md
- ./api_reference/packs/panel_chatbot.md - ./api_reference/packs/panel_chatbot.md
- ./api_reference/packs/query_understanding_agent.md - ./api_reference/packs/query_understanding_agent.md
- ./api_reference/packs/raft_dataset.md
- ./api_reference/packs/rag_cli_local.md - ./api_reference/packs/rag_cli_local.md
- ./api_reference/packs/rag_evaluator.md - ./api_reference/packs/rag_evaluator.md
- ./api_reference/packs/rag_fusion_query_pipeline.md - ./api_reference/packs/rag_fusion_query_pipeline.md
...@@ -1134,6 +1137,7 @@ nav: ...@@ -1134,6 +1137,7 @@ nav:
- ./api_reference/storage/chat_store/simple.md - ./api_reference/storage/chat_store/simple.md
- Docstore: - Docstore:
- ./api_reference/storage/docstore/dynamodb.md - ./api_reference/storage/docstore/dynamodb.md
- ./api_reference/storage/docstore/elasticsearch.md
- ./api_reference/storage/docstore/firestore.md - ./api_reference/storage/docstore/firestore.md
- ./api_reference/storage/docstore/index.md - ./api_reference/storage/docstore/index.md
- ./api_reference/storage/docstore/mongodb.md - ./api_reference/storage/docstore/mongodb.md
...@@ -1150,6 +1154,7 @@ nav: ...@@ -1150,6 +1154,7 @@ nav:
- ./api_reference/storage/graph_stores/simple.md - ./api_reference/storage/graph_stores/simple.md
- Index Store: - Index Store:
- ./api_reference/storage/index_store/dynamodb.md - ./api_reference/storage/index_store/dynamodb.md
- ./api_reference/storage/index_store/elasticsearch.md
- ./api_reference/storage/index_store/firestore.md - ./api_reference/storage/index_store/firestore.md
- ./api_reference/storage/index_store/index.md - ./api_reference/storage/index_store/index.md
- ./api_reference/storage/index_store/mongodb.md - ./api_reference/storage/index_store/mongodb.md
...@@ -1740,6 +1745,9 @@ plugins: ...@@ -1740,6 +1745,9 @@ plugins:
- ../llama-index-integrations/graph_stores/llama-index-graph-stores-neptune - ../llama-index-integrations/graph_stores/llama-index-graph-stores-neptune
- ../llama-index-integrations/embeddings/llama-index-embeddings-alephalpha - ../llama-index-integrations/embeddings/llama-index-embeddings-alephalpha
- ../llama-index-integrations/llms/llama-index-llms-alephalpha - ../llama-index-integrations/llms/llama-index-llms-alephalpha
- ../llama-index-packs/llama-index-packs-raft-dataset
- ../llama-index-integrations/storage/docstore/llama-index-storage-docstore-elasticsearch
- ../llama-index-integrations/storage/index_store/llama-index-storage-index-store-elasticsearch
- redirects: - redirects:
redirect_maps: redirect_maps:
./api/llama_index.vector_stores.MongoDBAtlasVectorSearch.html: api_reference/storage/vector_store/mongodb.md ./api/llama_index.vector_stores.MongoDBAtlasVectorSearch.html: api_reference/storage/vector_store/mongodb.md
......
from typing import Any, Optional
from llama_index.core.base.base_query_engine import BaseQueryEngine
from llama_index.core.evaluation import (
AnswerRelevancyEvaluator,
BaseEvaluator,
EvaluationResult,
)
from llama_index.core.tools import QueryEngineTool
from llama_index.core.tools.types import ToolMetadata, ToolOutput
DEFAULT_NAME = "query_engine_tool"
DEFAULT_DESCRIPTION = """Useful for running a natural language query
against a knowledge base and get back a natural language response.
"""
FAILED_TOOL_OUTPUT_TEMPLATE = (
"Could not use tool {tool_name} because it failed evaluation.\n" "Reason: {reason}"
)
class EvalQueryEngineTool(QueryEngineTool):
"""Evaluating query engine tool.
A tool that makes use of a query engine and an evaluator, where the
evaluation of the query engine response will determine the tool output.
Args:
evaluator (BaseEvaluator): A query engine.
query_engine (BaseQueryEngine): A query engine.
metadata (ToolMetadata): The associated metadata of the query engine.
"""
_evaluator: BaseEvaluator
_failed_tool_output_template: str
def __init__(
self,
evaluator: BaseEvaluator,
*args,
failed_tool_output_template: str = FAILED_TOOL_OUTPUT_TEMPLATE,
**kwargs
):
super().__init__(*args, **kwargs)
self._evaluator = evaluator
self._failed_tool_output_template = failed_tool_output_template
def _process_tool_output(
self,
tool_output: ToolOutput,
evaluation_result: EvaluationResult,
) -> ToolOutput:
if evaluation_result.passing:
return tool_output
tool_output.content = self._failed_tool_output_template.format(
tool_name=self.metadata.name,
reason=evaluation_result.feedback,
)
return tool_output
@classmethod
def from_defaults(
cls,
query_engine: BaseQueryEngine,
evaluator: Optional[BaseEvaluator] = None,
name: Optional[str] = None,
description: Optional[str] = None,
resolve_input_errors: bool = True,
) -> "EvalQueryEngineTool":
return cls(
evaluator=evaluator or AnswerRelevancyEvaluator(),
query_engine=query_engine,
metadata=ToolMetadata(
name=name or DEFAULT_NAME,
description=description or DEFAULT_DESCRIPTION,
),
resolve_input_errors=resolve_input_errors,
)
def call(self, *args: Any, **kwargs: Any) -> ToolOutput:
tool_output = super().call(*args, **kwargs)
evaluation_results = self._evaluator.evaluate_response(
tool_output.raw_input["input"], tool_output.raw_output
)
return self._process_tool_output(tool_output, evaluation_results)
async def acall(self, *args: Any, **kwargs: Any) -> ToolOutput:
tool_output = await super().acall(*args, **kwargs)
evaluation_results = await self._evaluator.aevaluate_response(
tool_output.raw_input["input"], tool_output.raw_output
)
return self._process_tool_output(tool_output, evaluation_results)
...@@ -61,18 +61,7 @@ class QueryEngineTool(AsyncBaseTool): ...@@ -61,18 +61,7 @@ class QueryEngineTool(AsyncBaseTool):
return self._metadata return self._metadata
def call(self, *args: Any, **kwargs: Any) -> ToolOutput: def call(self, *args: Any, **kwargs: Any) -> ToolOutput:
if args is not None and len(args) > 0: query_str = self._get_query_str(*args, **kwargs)
query_str = str(args[0])
elif kwargs is not None and "input" in kwargs:
# NOTE: this assumes our default function schema of `input`
query_str = kwargs["input"]
elif kwargs is not None and self._resolve_input_errors:
query_str = str(kwargs)
else:
raise ValueError(
"Cannot call query engine without specifying `input` parameter."
)
response = self._query_engine.query(query_str) response = self._query_engine.query(query_str)
return ToolOutput( return ToolOutput(
content=str(response), content=str(response),
...@@ -82,16 +71,7 @@ class QueryEngineTool(AsyncBaseTool): ...@@ -82,16 +71,7 @@ class QueryEngineTool(AsyncBaseTool):
) )
async def acall(self, *args: Any, **kwargs: Any) -> ToolOutput: async def acall(self, *args: Any, **kwargs: Any) -> ToolOutput:
if args is not None and len(args) > 0: query_str = self._get_query_str(*args, **kwargs)
query_str = str(args[0])
elif kwargs is not None and "input" in kwargs:
# NOTE: this assumes our default function schema of `input`
query_str = kwargs["input"]
elif kwargs is not None and self._resolve_input_errors:
query_str = str(kwargs)
else:
raise ValueError("Cannot call query engine without inputs")
response = await self._query_engine.aquery(query_str) response = await self._query_engine.aquery(query_str)
return ToolOutput( return ToolOutput(
content=str(response), content=str(response),
...@@ -112,3 +92,17 @@ class QueryEngineTool(AsyncBaseTool): ...@@ -112,3 +92,17 @@ class QueryEngineTool(AsyncBaseTool):
description=self.metadata.description, description=self.metadata.description,
) )
return LlamaIndexTool.from_tool_config(tool_config=tool_config) return LlamaIndexTool.from_tool_config(tool_config=tool_config)
def _get_query_str(self, *args, **kwargs) -> str:
if args is not None and len(args) > 0:
query_str = str(args[0])
elif kwargs is not None and "input" in kwargs:
# NOTE: this assumes our default function schema of `input`
query_str = kwargs["input"]
elif kwargs is not None and self._resolve_input_errors:
query_str = str(kwargs)
else:
raise ValueError(
"Cannot call query engine without specifying `input` parameter."
)
return query_str
"""Test EvalQueryEngine tool."""
from typing import Optional, Sequence, Any
from unittest import IsolatedAsyncioTestCase
from unittest.mock import AsyncMock
from llama_index.core.evaluation import EvaluationResult
from llama_index.core.evaluation.base import BaseEvaluator
from llama_index.core.prompts.mixin import PromptDictType
from llama_index.core.query_engine.custom import CustomQueryEngine
from llama_index.core.response import Response
from llama_index.core.tools.eval_query_engine import EvalQueryEngineTool
from llama_index.core.tools.types import ToolOutput
class MockEvaluator(BaseEvaluator):
"""Mock Evaluator for testing purposes."""
def _get_prompts(self) -> PromptDictType:
...
def _update_prompts(self, prompts_dict: PromptDictType) -> None:
...
async def aevaluate(
self,
query: Optional[str] = None,
response: Optional[str] = None,
contexts: Optional[Sequence[str]] = None,
**kwargs: Any,
) -> EvaluationResult:
...
class MockQueryEngine(CustomQueryEngine):
"""Custom query engine."""
def custom_query(self, query_str: str) -> str:
"""Query."""
return "custom_" + query_str
class TestEvalQueryEngineTool(IsolatedAsyncioTestCase):
def setUp(self) -> None:
self.mock_evaluator = MockEvaluator()
self.mock_evaluator.aevaluate = AsyncMock()
self.mock_evaluator.aevaluate.return_value = EvaluationResult(passing=True)
tool_name = "nice_tool"
self.tool_input = "hello world"
self.expected_content = f"custom_{self.tool_input}"
self.expected_tool_output = ToolOutput(
content=self.expected_content,
raw_input={"input": self.tool_input},
raw_output=Response(
response=self.expected_content,
source_nodes=[],
),
tool_name=tool_name,
)
self.eval_query_engine_tool = EvalQueryEngineTool.from_defaults(
MockQueryEngine(), evaluator=self.mock_evaluator, name=tool_name
)
def test_eval_query_engine_tool_with_eval_passing(self) -> None:
"""Test eval query engine tool with evaluation passing."""
tool_output = self.eval_query_engine_tool(self.tool_input)
self.assertEqual(self.expected_tool_output, tool_output)
def test_eval_query_engine_tool_with_eval_failing(self) -> None:
"""Test eval query engine tool with evaluation failing."""
evaluation_feedback = "The context does not provide a relevant answer."
self.mock_evaluator.aevaluate.return_value = EvaluationResult(
passing=False, feedback=evaluation_feedback
)
self.expected_tool_output.content = (
"Could not use tool nice_tool because it failed evaluation.\n"
f"Reason: {evaluation_feedback}"
)
tool_output = self.eval_query_engine_tool(self.tool_input)
self.assertEqual(self.expected_tool_output, tool_output)
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment