Skip to content
Snippets Groups Projects
Code owners
Assign users and groups as approvers for specific file changes. Learn more.
After you've reviewed these contribution guidelines, you'll be all set to contribute to this project.
CONTRIBUTING.md 21.98 KiB

Contributing to LlamaIndex

Interested in contributing to LlamaIndex? Here's how to get started!

Contribution Guideline

The best part of LlamaIndex is our community of users and contributors.

What should I work on?

  1. :new: Extend core modules by contributing an integration
  2. :package: Contribute a Tool, Reader, Pack, or Dataset (formerly from llama-hub)
  3. :brain: Add new capabilities to core
  4. :bug: Fix bugs
  5. :tada: Add usage examples
  6. :test_tube: Add experimental features
  7. :page_facing_up: Improve code quality & documentation

Also, join our Discord for ideas and discussions: https://discord.gg/dGcwcsnxhU.

1. :new: Extend Core Modules

The most impactful way to contribute to LlamaIndex is by extending our core modules: LlamaIndex modules

We welcome contributions in all modules shown above. So far, we have implemented a core set of functionalities for each, all of which are encapsulated in the LlamaIndex core package. As a contributor, you can help each module unlock its full potential. Provided below are brief description of these modules. You can also refer to their respective folders within this Github repository for some example integrations.

Contributing an integration involves submitting the source code for a new Python package. For now, these integrations will live in the LlamaIndex Github repository and the team will be responsible for publishing the package to PyPi. (Having these packages live outside of this repository and maintained by our community members is in consideration.)

Creating A New Integration Package

Both llama-index and llama-index-core come equipped with a command-line tool that can be used to initialize a new integration package.

cd ./llama-index-integrations/llms
llamaindex-cli new-package --kind "llms" --name "gemini"

Executing the above commands will create a new folder called llama-index-llms-gemini within the llama-index-integrations/llms directory.

Please ensure to add a detailed README for your new package as it will appear in both llamahub.ai as well as the PyPi.org website. In addition to preparing your source code and supplying a detailed README, we also ask that you fill in some metadata for your package to appear in llamahub.ai with the correct information. You do so by adding the required metadata under the [tool.llamahub] section with your new package's pyproject.toml.

Below is the example of the metadata required for all of our integration packages. Please replace the default author "llama-index" with your own Github user name.

[tool.llamahub]
contains_example = false
import_path = "llama_index.llms.anthropic"

[tool.llamahub.class_authors]
Anthropic = "llama-index"

(source)

NOTE: We are making rapid improvements to the project, and as a result, some interfaces are still volatile. Specifically, we are actively working on making the following components more modular and extensible (uncolored boxes above): core indexes, document stores, index queries, query runner

Module Details

Below, we will describe what each module does, give a high-level idea of the interface, show existing implementations, and give some ideas for contribution.


Data Loaders

A data loader ingests data of any format from anywhere into Document objects, which can then be parsed and indexed.

Interface:

  • load_data takes arbitrary arguments as input (e.g. path to data), and outputs a sequence of Document objects.
  • lazy_load_data takes arbitrary arguments as input (e.g. path to data), and outputs an iterable object of Document objects. This is a lazy version of load_data, which is useful for large datasets.

Note: If only lazy_load_data is implemented, load_data will be delegated to it.

Examples:

Contributing a data loader is easy and super impactful for the community. The preferred way to contribute is by making a PR at LlamaHub Github.

Ideas

  • Want to load something but there's no LlamaHub data loader for it yet? Make a PR!

Node Parser

A node parser parses Document objects into Node objects (atomic units of data that LlamaIndex operates over, e.g., chunk of text, image, or table). It is responsible for splitting text (via text splitters) and explicitly modeling the relationship between units of data (e.g. A is the source of B, C is a chunk after D).

Interface: get_nodes_from_documents takes a sequence of Document objects as input, and outputs a sequence of Node objects.

Examples:

See the API reference for full details.

Ideas:

  • Add new Node relationships to model hierarchical documents (e.g. play-act-scene, chapter-section-heading).

Text Splitters

Text splitter splits a long text str into smaller text str chunks with desired size and splitting "strategy" since LLMs have a limited context window size, and the quality of text chunk used as context impacts the quality of query results.

Interface: split_text takes a str as input, and outputs a sequence of str

Examples:


Document/Index/KV Stores

Under the hood, LlamaIndex also supports a swappable storage layer that allows you to customize Document Stores (where ingested documents (i.e., Node objects) are stored), and Index Stores (where index metadata are stored)

We have an underlying key-value abstraction backing the document/index stores. Currently we support in-memory and MongoDB storage for these stores. Open to contributions!

See Storage guide for details.


Managed Index

A managed index is used to represent an index that's managed via an API, exposing API calls to index documents and query documents.

Currently we support the VectaraIndex. Open to contributions!

See Managed Index docs for details.


Vector Stores

Our vector store classes store embeddings and support lookup via similarity search. These serve as the main data store and retrieval engine for our vector index.

Interface:

  • add takes in a sequence of NodeWithEmbeddings and inserts the embeddings (and possibly the node contents & metadata) into the vector store.
  • delete removes entries given document IDs.
  • query retrieves top-k most similar entries given a query embedding.

Examples:

Ideas:

  • See a vector database out there that we don't support yet? Make a PR!

See reference for full details.


Retrievers

Our retriever classes are lightweight classes that implement a retrieve method. They may take in an index class as input - by default, each of our indices (list, vector, keyword) has an associated retriever. The output is a set of NodeWithScore objects (a Node object with an extra score field).

You may also choose to implement your own retriever classes on top of your own data if you wish.

Interface:

  • retrieve takes in a str or QueryBundle as input, and outputs a list of NodeWithScore objects

Examples:

Ideas:

  • Besides the "default" retrievers built on top of each index, what about fancier retrievers? E.g. retrievers that take in other retrievers as input? Or other types of data?

Query Engines

Our query engine classes are lightweight classes that implement a query method; the query returns a response type. For instance, they may take in a retriever class as input; our RetrieverQueryEngine takes in a retriever as input as well as a BaseSynthesizer class for response synthesis, and the query method performs retrieval and synthesis before returning the final result. They may take in other query engine classes as input too.

Interface:

  • query takes in a str or QueryBundle as input, and outputs a Response object.

Examples:


Query Transforms

A query transform augments a raw query string with associated transformations to improve index querying. This can interpreted as a pre-processing stage, before the core index query logic is executed.

Interface: run takes in a str or Querybundle as input, and outputs a transformed QueryBundle.

Examples:

See guide for more information.


Token Usage Optimizers

A token usage optimizer refines the retrieved Nodes to reduce token usage during response synthesis.

Interface: optimize takes in the QueryBundle and a text chunk str, and outputs a refined text chunk str that yields a more optimized response

Examples:


Node Postprocessors

A node postprocessor refines a list of retrieved nodes given configuration and context.

Interface: postprocess_nodes takes a list of Nodes and extra metadata (e.g. similarity and query), and outputs a refined list of Nodes.

Examples:


Output Parsers

An output parser enables us to extract structured output from the plain text output generated by the LLM.

Interface:

  • format: formats a query str with structured output formatting instructions, and outputs the formatted str
  • parse: takes a str (from LLM response) as input, and gives a parsed structured output (optionally also validated, error-corrected).

Examples:

See guide for more information.


2. :package: Contribute a Pack, Reader, Tool, or Dataset (formerly from llama-hub)

Tools, Readers, and Packs have all been migrated from llama-hub to the main llama-index repo (i.e., this one). Datasets still reside in llama-hub, but will be migrated here in the near future.

Contributing a new Reader or Tool involves submitting a new package within the llama-index-integrations/readers and llama-index-integrations/tools, folders respectively.

The LlamaIndex command-line tool can be used to initialize new Packs and Integrations. (NOTE: llama-index-cli comes installed with llama-index.)

cd ./llama-index-packs
llamaindex-cli new-package --kind "packs" --name "my new pack"

cd ./llama-index-integrations/readers
llamaindex-cli new-package --kind "readers" --name "new reader"

Executing the first set of shell commands will create a new folder called llama-index-packs-my-new-pack within the llama-index-packs directory. While the second set will create a new package directory called llama-index-readers-new-reader within the llama-index-integrations/readers directory.

Please ensure to add a detailed README for your new package as it will appear in both llamahub.ai as well as the PyPi.org website. In addition to preparing your source code and supplying a detailed README, we also ask that you fill in some metadata for your package to appear in llamahub.ai with the correct information. You do so by adding the required metadata under the [tool.llamahub] section with your new package's pyproject.toml.