Skip to content
Snippets Groups Projects
Unverified Commit 263b7247 authored by Jerry Liu's avatar Jerry Liu Committed by GitHub
Browse files

take 2: Add documentation (#103)

parent fad2c670
No related branches found
No related tags found
No related merge requests found
Showing
with 437 additions and 36 deletions
version: 2
sphinx:
configuration: docs/conf.py
formats: all
python:
version: 3.8
install:
- requirements: docs/requirements.txt
- method: pip
path: .
\ No newline at end of file
# 🗂️ ️GPT Index
GPT Index is a project consisting of a set of *data structures* that are created using GPT-3 and can be traversed using GPT-3 in order to answer queries.
GPT Index is a project consisting of a set of *data structures* that are created using LLMs and can be traversed using LLMs in order to answer queries.
PyPi: https://pypi.org/project/gpt-index/.
Documentation: https://gpt-index.readthedocs.io/en/latest/.
## 🚀 Overview
#### Context
- GPT-3 is a phenomenonal piece of technology for knowledge generation and reasoning.
- A big limitation of GPT-3 is context size (e.g. Davinci's limit is 4096 tokens. Large, but not infinite).
- The ability to feed "knowledge" to GPT-3 is restricted to this limited prompt size and model weights.
- **Thought**: What if GPT-3 can have access to potentially a much larger database of knowledge without retraining/finetuning?
- LLMs are a phenomenonal piece of technology for knowledge generation and reasoning.
- A big limitation of LLMs is context size (e.g. OpenAI's `davinci` model for GPT-3 has a [limit](https://openai.com/api/pricing/) of 4096 tokens. Large, but not infinite).
- The ability to feed "knowledge" to LLMs is restricted to this limited prompt size and model weights.
- **Thought**: What if LLMs can have access to potentially a much larger database of knowledge without retraining/finetuning?
#### Proposed Solution
That's where the **GPT Index** data structures come in. Instead of relying on world knowledge encoded in the model weights, a GPT Index data structure does the following:
- Uses a pre-trained GPT-3 model primarily for *reasoning*/*summarization* instead of prior knowledge.
- Takes as input a large corpus of text data and build a structured index over it (using GPT-3 or heuristics).
- Allow users to _query_ the index in order to synthesize an answer to the question - this requires both _traversal_ of the index as well as a synthesis of the answer.
That's where the **GPT Index** comes in. GPT Index is a simple, flexible interface between your external data and LLMs. It resolves the following pain points:
- Provides simple data structures to resolve prompt size limitations.
- Offers data connectors to your external data sources.
- Offers you a comprehensive toolset trading off cost and performance.
At the core of GPT Index is a **data structure**. Instead of relying on world knowledge encoded in the model weights, a GPT Index data structure does the following:
- Uses a pre-trained LLM primarily for *reasoning*/*summarization* instead of prior knowledge.
- Takes as input a large corpus of text data and build a structured index over it (using an LLM or heuristics).
- Allow users to *query* the index in order to synthesize an answer to the question - this requires both *traversal* of the index as well as a synthesis of the answer.
## 📄 Documentation
Full documentation can be found here: https://gpt-index.readthedocs.io/en/latest/.
Please check it out for the most up-to-date tutorials, how-to guides, references, and other resources!
The high-level design exercise of this project is to test the capability of GPT-3 as a general-purpose processor to organize and retrieve data. From our current understanding, related works have used GPT-3 to reason with external db sources (see below); this work links reasoning with knowledge building.
## 💻 Example Usage
......@@ -53,32 +69,6 @@ The main third-party package requirements are `transformers`, `openai`, and `lan
All requirements should be contained within the `setup.py` file. To run the package locally without building the wheel, simply do `pip install -r requirements.txt`.
## Index Data Structures
- [`Tree Index`](gpt_index/indices/tree/README.md): Tree data structures
- **Creation**: with GPT hierarchical summarization over sub-documents
- **Query**: with GPT recursive querying over multiple choice problems
- [`Keyword Table Index`](gpt_index/indices/keyword_table/README.md): a keyword-based table
- **Creation**: with GPT keyword extraction over each sub-document
- **Query**: with GPT keyword extraction over question, match to sub-documents. *Create and refine* an answer over candidate sub-documents.
- [`List Index`](gpt_index/indices/list/README.md): a simple list-based data structure
- **Creation**: by splitting documents into a list of text chunks
- **Query**: use GPT with a create and refine prompt iterately over the list of sub-documents
## Data Connectors
We currently offer connectors into the following data sources. External data sources are retrieved through their APIs + corresponding authentication token.
- Notion (`NotionPageReader`)
- Google Drive (`GoogleDocsReader`)
- Slack (`SlackReader`)
- MongoDB (local) (`SimpleMongoReader`)
- Wikipedia (`WikipediaReader`)
- local file directory (`SimpleDirectoryReader`)
Example notebooks of how to use data connectors are found in the [Data Connector Example Notebooks](examples/data_connectors).
## 🔬 Related Work [WIP]
[Measuring and Narrowing the Compositionality Gap in Language Models, by Press et al.](https://arxiv.org/abs/2210.03350)
......
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
docs/_static/composability/diagram.png

30.3 KiB

docs/_static/composability/diagram_b0.png

13.8 KiB

docs/_static/composability/diagram_b1.png

29.8 KiB

docs/_static/composability/diagram_q1.png

39.8 KiB

docs/_static/composability/diagram_q2.png

54.1 KiB

"""Configuration for sphinx."""
# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
import sphinx_rtd_theme # noqa: F401
sys.path.insert(0, os.path.abspath("../"))
# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
project = "GPT Index"
copyright = "2022, Jerry Liu"
author = "Jerry Liu"
# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
extensions = [
"sphinx.ext.autodoc",
"sphinx.ext.coverage",
"sphinx.ext.autodoc.typehints",
"sphinx.ext.autosummary",
"sphinx.ext.napoleon",
"sphinx_rtd_theme",
"sphinx.ext.mathjax",
"myst_parser",
]
templates_path = ["_templates"]
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
html_theme = "sphinx_rtd_theme"
html_static_path = ["_static"]
# Installation and Setup
### Installation from Pip
You can simply do:
```
pip install gpt-index
```
### Installation from Source
Git clone this repository: `git clone git@github.com:jerryjliu/gpt_index.git`. Then do:
- `pip install -e .` if you want to do an editable install (you can modify source files) of just the package itself.
- `pip install -r requirements.txt` if you want to install optional dependencies + dependencies used for development (e.g. unit testing).
### Environment Setup
By default, we use the OpenAI GPT-3 `text-davinci-002` model. In order to use this, you must have an OPENAI_API_KEY setup.
You can register an API key by logging into [OpenAI's page and creating a new API token](https://beta.openai.com/account/api-keys)
Overview
=====================================
This section shows you how to quickly get up and running with GPT Index.
# Starter Tutorial
Here is a starter example for using GPT Index. Make sure you've followed the [installation](installation.md) steps first.
### Download
GPT Index examples can be found in the `examples` folder of the GPT Index repository.
We first want to download this `examples` folder. An easy way to do this is to just clone the repo:
```bash
$ git clone git@github.com:jerryjliu/gpt_index.git
```
Next, navigate to your newly-cloned repository, and verify the contents:
```bash
$ cd gpt_index
$ ls
LICENSE data_requirements.txt tests/
MANIFEST.in examples/ pyproject.toml
Makefile experimental/ requirements.txt
README.md gpt_index/ setup.py
```
We now want to navigate to the following folder:
```bash
$ cd examples/paul_graham_essay
```
This contains GPT Index examples around Paul Graham's essay, ["What I Worked On"](http://paulgraham.com/worked.html). A comprehensive set of examples are already provided in `TestEssay.ipynb`. For the purposes of this tutorial, we can focus on a simple example of getting GPT Index up and running.
### Build and Query Index
Create a new `.py` file with the following:
```python
from gpt_index import GPTTreeIndex, SimpleDirectoryReader
from IPython.display import Markdown, display
documents = SimpleDirectoryReader('data').load_data()
index = GPTTreeIndex(documents)
```
This builds an index over the documents in the `data` folder (which in this case just consists of the essay text). We then run the following
```python
response = index.query("What did the author do growing up?")
print(response)
```
You should get back a response similar to the following: `The author wrote short stories and tried to program on an IBM 1401.`
### Saving and Loading
To save to disk and load from disk, do
```python
# save to disk
index.save_to_disk('index.json')
# load from disk
index = GPTTreeIndex.load_from_disk('index.json')
```
### Next Steps
That's it! For more information on GPT Index features, please check out the numerous "How-To Guides" to the left.
Additionally, if you would like to play around with Example Notebooks, check out [this link](/reference/example_notebooks.rst).
# Composability
GPT Index offers **composability** of your indices, meaning that you can build indices on top of other indices. This allows you to more effectively index your entire document tree in order to feed custom knowledge to GPT.
Composability allows you to to define lower-level indices for each document, and higher-order indices over a collection of documents. To see how this works, imagine defining 1) a tree index for the text within each document, and 2) a list index over each tree index (one document) within your collection.
To see how this works, imagine you have 3 documents: `doc1`, `doc2`, and `doc3`.
```python
doc1 = SimpleDirectoryReader('data1').load_data()
doc2 = SimpleDirectoryReader('data2').load_data()
doc3 = SimpleDirectoryReader('data3').load_data()
```
![](/_static/composability/diagram_b0.png)
Now let's define a tree index for each document. In Python, we have:
```python
index1 = GPTTreeIndex(doc1)
index2 = GPTTreeIndex(doc2)
index2 = GPTTreeIndex(doc2)
```
![](/_static/composability/diagram_b1.png)
We can then create a list index on these 3 tree indices:
```python
list_index = GPTListIndex([index1, index2, index3])
```
![](/_static/composability/diagram.png)
During a query, we would start with the top-level list index. Each node in the list corresponds to an underlying tree index.
```python
response = list_index.query("Where did the author grow up?")
```
![](/_static/composability/diagram_q1.png)
So within a node, instead of fetching the text, we would recursively query the stored tree index to retrieve our answer.
![](/_static/composability/diagram_q2.png)
NOTE: You can stack indices as many times as you want, depending on the hierarchies of your knowledge base!
We can take a look at a code example below as well. We first build two tree indices, one over the Wikipedia NYC page, and the other over Paul Graham's essay. We then define a keyword extractor index over the two tree indices.
[Here is an example notebook](https://github.com/jerryjliu/gpt_index/blob/main/examples/composable_indices/ComposableIndices.ipynb).
# Cost Analysis
Each call to an LLM will cost some amount of money - for instance, OpenAI's Davinci costs $0.02 / 1k tokens. The cost of building an index and querying depends on
1. the type of LLM used
2. the type of data structure used
3. parameters used during building
4. parameters used during querying
The cost of building and querying each index is a TODO in the reference documentation. In the meantime, here is a high-level overview of the cost structure of the indices.
### Index Building
#### Indices with no LLM calls
The following indices don't require LLM calls at all during building (0 cost):
- `GPTListIndex`
- `GPTSimpleKeywordTableIndex` - uses a regex keyword extractor to extract keywords from each document
- `GPTRAKEKeywordTableIndex` - uses a RAKE keyword extractor to extract keywords from each document
#### Indices with LLM calls
The following indices do require LLM calls during build time:
- `GPTTreeIndex` - use LLM to hierarchically summarize the text to build the tree
- `GPTKeywordTableIndex` - use LLM to extract keywords from each document
### Query Time
There will always be >= 1 LLM call during query time, in order to synthesize the final answer.
Some indices contain cost tradeoffs between index building and querying. `GPTListIndex`, for instance,
is free to build, but running a query over a list index (without filtering or embedding lookups), will
call the LLM {math}`N` times.
Here are some notes regarding each of the indices:
- `GPTListIndex`: by default requires {math}`N` LLM calls, where N is the number of nodes.
- However, can do `index.query(..., keyword="<keyword>")` to filter out nodes that don't contain the keyword
- `GPTTreeIndex`: by default requires {math}`\log (N)` LLM calls, where N is the number of leaf nodes.
- Setting `child_branch_factor=2` will be more expensive than the default `child_branch_factor=1` (polynomial vs logarithmic), because we traverse 2 children instead of just 1 for each parent node.
- `GPTKeywordTableIndex`: by default requires an LLM call to extract query keywords.
- Can do `index.query(..., mode="simple")` or `index.query(..., mode="rake")` to also use regex/RAKE keyword extractors on your query text.
\ No newline at end of file
# Defining LLMs
The goal of GPT Index is to provide a toolkit of data structures that can organize external information in a manner that
is easily compatible with the prompt limitations of an LLM. Therefore LLMs are always used to construct the final
answer.
Depending on the [type of index](/reference/indices.rst) being used,
LLMs may also be used during index construction, insertion, and query traversal.
GPT Index uses Langchain's [LLM](https://langchain.readthedocs.io/en/latest/modules/llms.html)
and [LLMChain](https://langchain.readthedocs.io/en/latest/modules/chains.html) module to define
the underlying abstraction. We introduce a wrapper class,
[`LLMPredictor`](/reference/llm_predictor.rst), for integration into GPT Index.
By default, we use OpenAI's `text-davinci-002` model. But you may choose to customize
the underlying LLM being used.
## Example
An example snippet of customizing the LLM being used is shown below.
In this example, we use `text-davinci-003` instead of `text-davinci-002`. Note that
you may plug in any LLM shown on Langchain's
[LLM](https://langchain.readthedocs.io/en/latest/modules/llms.html) page.
```python
from gpt_index import GPTKeywordTableIndex, SimpleDirectoryReader, LLMPredictor
from langchain import OpenAI
# define LLM
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-002"))
# load index from disk
index = GPTKeywordTableIndex.load_from_disk('index_table.json', llm_predictor=llm_predictor)
# get response from query
response = index.query("What did the author do after his time at Y Combinator?")
```
In this snipet, the index has already been created and saved to disk. We load
the existing index, and swap in a new `LLMPredictor` that is used during query time.
\ No newline at end of file
# Defining Prompts
Prompting is the fundamental input that gives LLMs their expressive power. GPT Index uses prompts to build the index, do insertion,
perform traversal during querying, and to synthesize the final answer.
GPT Index uses a finite set of *prompt types*, described [here](/reference/prompts.rst).
All index classes, along with their associated queries, utilize a subset of these prompts. The user may provide their own prompt.
If the user does not provide their own prompt, default prompts are used.
An API reference of all index classes and query classes are found below. The definition of each index class and query
contains optional prompts that the user may pass in.
- [Indices](/reference/indices.rst)
- [Queries](/reference/query.rst)
### Example
An example can be found in [this notebook](https://github.com/jerryjliu/gpt_index/blob/main/examples/paul_graham_essay/TestEssay.ipynb).
The corresponding snippet is below. We show how to define a custom Summarization Prompt that not only
contains a `text` field, but also `query_str` field during construction of `GPTTreeIndex`, so that
the answer to the query can be simply synthesized from the root nodes.
```python
from gpt_index import Prompt, GPTTreeIndex, SimpleDirectoryReader
# load documents
documents = SimpleDirectoryReader('data').load_data()
# define custom prompt
query_str = "What did the author do growing up?"
summary_prompt_tmpl = (
"Context information is below. \n"
"---------------------\n"
"{text}"
"\n---------------------\n"
"Given the context information and not prior knowledge, "
"answer the question: {query_str}\n"
)
summary_prompt = Prompt(
input_variables=["query_str", "text"],
template=DEFAULT_TEXT_QA_PROMPT_TMPL
)
# Build GPTTreeIndex: pass in custom prompt, also pass in query_str
index_with_query = GPTTreeIndex(documents, summary_template=summary_prompt, query_str=query_str)
```
Once the index is built, we can retrieve our answer:
```python
# directly retrieve response from root nodes instead of traversing tree
response = index_with_query.query(query_str, mode="retrieve")
```
# Data Connectors
We currently offer connectors into the following data sources. External data sources are retrieved through their APIs + corresponding authentication token.
The API reference documentation can be found [here](/reference/readers.rst).
- [Notion](https://developers.notion.com/) (`NotionPageReader`)
- [Google Docs](https://developers.google.com/docs/api) (`GoogleDocsReader`)
- [Slack](https://api.slack.com/) (`SlackReader`)
- MongoDB (`SimpleMongoReader`)
- Wikipedia (`WikipediaReader`)
- local file directory (`SimpleDirectoryReader`)
We offer [example notebooks of connecting to different data sources](https://github.com/jerryjliu/gpt_index/tree/main/examples/data_connectors). Please check them out!
\ No newline at end of file
# Embedding support
GPT Index provides embedding support to our tree and list indices. In addition to each node storing text, each node can optionally store an embedding.
During query-time, we can use embeddings to do max-similarity retrieval of nodes before calling the LLM to synthesize an answer.
Since similarity lookup using embeddings (e.g. using cosine similarity) does not require a LLM call, embeddings serve as a cheaper lookup mechanism instead
of using LLMs to traverse nodes.
NOTE: we currently support OpenAI embeddings. External embeddings are coming soon!
**How are Embeddings Generated?**
Embeddings are lazily generated and then cached at query time (if mode="embedding" is specified during `index.query`), and not during index construction.
This design choice prevents the need to generate embeddings for all text chunks during index construction.
**Embedding Lookups**
For the list index:
- We iterate through every node in the list, and identify the top k nodes through embedding similarity. We use these nodes to synthesize an answer.
- See the [List Query API](/reference/indices/list_query.rst) for more details.
For the tree index:
- We start with the root nodes, and traverse down the tree by picking the child node through embedding similarity.
- See the [Tree Query API](/reference/indices/tree_query.rst) for more details.
**Example Notebook**
An example notebook is given [here](https://github.com/jerryjliu/gpt_index/blob/main/examples/test_wiki/TestNYC_Embeddings.ipynb).
# Insert Capabilities
Every GPT Index data structure allows insertion.
An example notebook showcasing our insert capabilities is given [here](https://github.com/jerryjliu/gpt_index/blob/main/examples/paul_graham_essay/InsertDemo.ipynb).
\ No newline at end of file
Overview
=====================================
The how-to section contains guides on some of the core features of GPT Index:
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment