Mini Esg Bench Dataset
CLI Usage
You can download llamadatasets
directly using llamaindex-cli
, which comes installed with the llama-index
python package:
llamaindex-cli download-llamadataset MiniEsgBenchDataset --download-dir ./data
You can then inspect the files at ./data
. When you're ready to load the data into
python, you can use the below snippet of code:
from llama_index.core import SimpleDirectoryReader
from llama_index.core.llama_dataset import LabelledRagDataset
rag_dataset = LabelledRagDataset.from_json("./data/rag_dataset.json")
documents = SimpleDirectoryReader(input_dir="./data/source_files").load_data()
Code Usage
You can download the dataset to a directory, say ./data
directly in Python
as well. From there, you can use the convenient RagEvaluatorPack
llamapack to
run your own LlamaIndex RAG pipeline with the llamadataset
.
from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core import VectorStoreIndex
# download and install dependencies for benchmark dataset
rag_dataset, documents = download_llama_dataset(
"MiniEsgBenchDataset", "./data"
)
# build basic RAG system
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()
# evaluate using the RagEvaluatorPack
RagEvaluatorPack = download_llama_pack(
"RagEvaluatorPack", "./rag_evaluator_pack"
)
rag_evaluator_pack = RagEvaluatorPack(
rag_dataset=rag_dataset, query_engine=query_engine
)
############################################################################
# NOTE: If have a lower tier subscription for OpenAI API like Usage Tier 1 #
# then you'll need to use different batch_size and sleep_time_in_seconds. #
# For Usage Tier 1, settings that seemed to work well were batch_size=5, #
# and sleep_time_in_seconds=15 (as of December 2023.) #
############################################################################
benchmark_df = await rag_evaluator_pack.arun(
batch_size=20, # batches the number of openai api calls to make
sleep_time_in_seconds=1, # seconds to sleep before making an api call
)
Citing the data
If you choose to use this dataset for research, it would be appreciated if you could cite it with given details below.
@misc{llamaindex_mini_esg_bench_2023,
title={Mini ESG Bench},
author={Val Andrei Fajardo},
year={2023},
organization={llamaindex}
}