take 2: Add documentation (#103)

Co-authored-by: Jerry Liu <jerry@robustintelligence.com>

take 2: Add documentation (#103)
263b7247 · Jerry Liu · GitHub · fad2c670 · 263b7247 · 263b7247
Unverified Commit 263b7247 authored 2 years ago by Jerry Liu Committed by GitHub 2 years ago
--- a/.readthedocs.yaml
+++ b/.readthedocs.yaml
+version: 2
+sphinx:
+  configuration: docs/conf.py
+formats: all
+python:
+  version: 3.8
+  install:
+    - requirements: docs/requirements.txt
+    - method: pip
+      path: .
\ No newline at end of file
--- a/README.md
+++ b/README.md
 # 🗂️ ️GPT Index

-GPT Index is a project consisting of a set of *data structures* that are created using GPT-3 and can be traversed using GPT-3 in order to answer queries.
+GPT Index is a project consisting of a set of *data structures* that are created using LLMs and can be traversed using LLMs in order to answer queries.
+
+PyPi: https://pypi.org/project/gpt-index/.
+
+Documentation: https://gpt-index.readthedocs.io/en/latest/.

 ## 🚀 Overview

 #### Context
- GPT-3 is a phenomenonal piece of technology for knowledge generation and reasoning.
- A big limitation of GPT-3 is context size (e.g. Davinci's limit is 4096 tokens. Large, but not infinite).
- The ability to feed "knowledge" to GPT-3 is restricted to this limited prompt size and model weights.
- **Thought**: What if GPT-3 can have access to potentially a much larger database of knowledge without retraining/finetuning? 
+- LLMs are a phenomenonal piece of technology for knowledge generation and reasoning.
+- A big limitation of LLMs is context size (e.g. OpenAI's `davinci` model for GPT-3 has a [limit](https://openai.com/api/pricing/) of 4096 tokens. Large, but not infinite).
+- The ability to feed "knowledge" to LLMs is restricted to this limited prompt size and model weights.
+- **Thought**: What if LLMs can have access to potentially a much larger database of knowledge without retraining/finetuning? 

 #### Proposed Solution
-That's where the **GPT Index** data structures come in. Instead of relying on world knowledge encoded in the model weights, a GPT Index data structure does the following:
- Uses a pre-trained GPT-3 model primarily for *reasoning*/*summarization* instead of prior knowledge.
- Takes as input a large corpus of text data and build a structured index over it (using GPT-3 or heuristics).
- Allow users to _query_ the index in order to synthesize an answer to the question - this requires both _traversal_ of the index as well as a synthesis of the answer.
+That's where the **GPT Index** comes in. GPT Index is a simple, flexible interface between your external data and LLMs. It resolves the following pain points:
+
+- Provides simple data structures to resolve prompt size limitations.
+- Offers data connectors to your external data sources.
+- Offers you a comprehensive toolset trading off cost and performance.
+
+At the core of GPT Index is a **data structure**. Instead of relying on world knowledge encoded in the model weights, a GPT Index data structure does the following:
+
+- Uses a pre-trained LLM primarily for *reasoning*/*summarization* instead of prior knowledge.
+- Takes as input a large corpus of text data and build a structured index over it (using an LLM or heuristics).
+- Allow users to *query* the index in order to synthesize an answer to the question - this requires both *traversal* of the index as well as a synthesis of the answer.
+
+## 📄 Documentation
+
+Full documentation can be found here: https://gpt-index.readthedocs.io/en/latest/. 
+
+Please check it out for the most up-to-date tutorials, how-to guides, references, and other resources! 

-The high-level design exercise of this project is to test the capability of GPT-3 as a general-purpose processor to organize and retrieve data. From our current understanding, related works have used GPT-3 to reason with external db sources (see below); this work links reasoning with knowledge building.

 ## 💻 Example Usage

@@ -53,32 +69,6 @@ The main third-party package requirements are `transformers`, `openai`, and `lan
 All requirements should be contained within the `setup.py` file. To run the package locally without building the wheel, simply do `pip install -r requirements.txt`. 


-## Index Data Structures
-
- [`Tree Index`](gpt_index/indices/tree/README.md): Tree data structures
-    - **Creation**: with GPT hierarchical summarization over sub-documents
-    - **Query**: with GPT recursive querying over multiple choice problems
- [`Keyword Table Index`](gpt_index/indices/keyword_table/README.md): a keyword-based table
-    - **Creation**: with GPT keyword extraction over each sub-document
-    - **Query**: with GPT keyword extraction over question, match to sub-documents. *Create and refine* an answer over candidate sub-documents.
- [`List Index`](gpt_index/indices/list/README.md): a simple list-based data structure
-    - **Creation**: by splitting documents into a list of text chunks
-    - **Query**: use GPT with a create and refine prompt iterately over the list of sub-documents
-
-
-## Data Connectors
-
-We currently offer connectors into the following data sources. External data sources are retrieved through their APIs + corresponding authentication token.
- Notion (`NotionPageReader`)
- Google Drive (`GoogleDocsReader`)
- Slack (`SlackReader`)
- MongoDB (local) (`SimpleMongoReader`)
- Wikipedia (`WikipediaReader`)
- local file directory (`SimpleDirectoryReader`)
-
-Example notebooks of how to use data connectors are found in the [Data Connector Example Notebooks](examples/data_connectors).
-
-
 ## 🔬 Related Work [WIP]

 [Measuring and Narrowing the Compositionality Gap in Language Models, by Press et al.](https://arxiv.org/abs/2210.03350)

--- a/docs/Makefile
+++ b/docs/Makefile
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs/_static/composability/diagram.png
+++ b/docs/_static/composability/diagram.png
--- a/docs/_static/composability/diagram_b0.png
+++ b/docs/_static/composability/diagram_b0.png
--- a/docs/_static/composability/diagram_b1.png
+++ b/docs/_static/composability/diagram_b1.png
--- a/docs/_static/composability/diagram_q1.png
+++ b/docs/_static/composability/diagram_q1.png
--- a/docs/_static/composability/diagram_q2.png
+++ b/docs/_static/composability/diagram_q2.png
--- a/docs/conf.py
+++ b/docs/conf.py
+"""Configuration for sphinx."""
+# Configuration file for the Sphinx documentation builder.
+#
+# For the full list of built-in configuration values, see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import sys
+
+import sphinx_rtd_theme  # noqa: F401
+
+sys.path.insert(0, os.path.abspath("../"))
+
+# -- Project information -----------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
+
+
+project = "GPT Index"
+copyright = "2022, Jerry Liu"
+author = "Jerry Liu"
+
+# -- General configuration ---------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
+
+extensions = [
+    "sphinx.ext.autodoc",
+    "sphinx.ext.coverage",
+    "sphinx.ext.autodoc.typehints",
+    "sphinx.ext.autosummary",
+    "sphinx.ext.napoleon",
+    "sphinx_rtd_theme",
+    "sphinx.ext.mathjax",
+    "myst_parser",
+]
+
+templates_path = ["_templates"]
+exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
+
+
+# -- Options for HTML output -------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
+
+html_theme = "sphinx_rtd_theme"
+html_static_path = ["_static"]
--- a/docs/getting_started/installation.md
+++ b/docs/getting_started/installation.md
+# Installation and Setup
+
+### Installation from Pip
+
+You can simply do:
+```
+pip install gpt-index
+```
+
+### Installation from Source
+Git clone this repository: `git clone git@github.com:jerryjliu/gpt_index.git`. Then do:
+
+- `pip install -e .` if you want to do an editable install (you can modify source files) of just the package itself.
+- `pip install -r requirements.txt` if you want to install optional dependencies + dependencies used for development (e.g. unit testing).
+
+
+### Environment Setup
+
+By default, we use the OpenAI GPT-3 `text-davinci-002` model. In order to use this, you must have an OPENAI_API_KEY setup.
+You can register an API key by logging into [OpenAI's page and creating a new API token](https://beta.openai.com/account/api-keys)
--- a/docs/getting_started/overview.rst
+++ b/docs/getting_started/overview.rst
+Overview
+=====================================
+
+This section shows you how to quickly get up and running with GPT Index.
--- a/docs/getting_started/starter_example.md
+++ b/docs/getting_started/starter_example.md
+# Starter Tutorial
+
+Here is a starter example for using GPT Index. Make sure you've followed the [installation](installation.md) steps first.
+
+
+### Download
+GPT Index examples can be found in the `examples` folder of the GPT Index repository. 
+We first want to download this `examples` folder. An easy way to do this is to just clone the repo: 
+
+```bash
+$ git clone git@github.com:jerryjliu/gpt_index.git
+```
+
+Next, navigate to your newly-cloned repository, and verify the contents:
+
+```bash
+$ cd gpt_index
+$ ls
+LICENSE                data_requirements.txt  tests/
+MANIFEST.in            examples/              pyproject.toml
+Makefile               experimental/          requirements.txt
+README.md              gpt_index/             setup.py
+```
+
+
+We now want to navigate to the following folder:
+```bash
+$ cd examples/paul_graham_essay
+```
+
+This contains GPT Index examples around Paul Graham's essay, ["What I Worked On"](http://paulgraham.com/worked.html). A comprehensive set of examples are already provided in `TestEssay.ipynb`. For the purposes of this tutorial, we can focus on a simple example of getting GPT Index up and running.
+
+
+### Build and Query Index
+Create a new `.py` file with the following:
+
+```python
+from gpt_index import GPTTreeIndex, SimpleDirectoryReader
+from IPython.display import Markdown, display
+
+documents = SimpleDirectoryReader('data').load_data()
+index = GPTTreeIndex(documents)
+```
+
+This builds an index over the documents in the `data` folder (which in this case just consists of the essay text). We then run the following
+```python
+response = index.query("What did the author do growing up?")
+print(response)
+```
+
+You should get back a response similar to the following: `The author wrote short stories and tried to program on an IBM 1401.`
+
+### Saving and Loading
+
+To save to disk and load from disk, do
+
+```python
+# save to disk
+index.save_to_disk('index.json')
+# load from disk
+index = GPTTreeIndex.load_from_disk('index.json')
+```
+
+
+### Next Steps
+
+That's it! For more information on GPT Index features, please check out the numerous "How-To Guides" to the left.
+Additionally, if you would like to play around with Example Notebooks, check out [this link](/reference/example_notebooks.rst).
+
--- a/docs/how_to/composability.md
+++ b/docs/how_to/composability.md
+# Composability
+
+
+GPT Index offers **composability** of your indices, meaning that you can build indices on top of other indices. This allows you to more effectively index your entire document tree in order to feed custom knowledge to GPT.
+
+Composability allows you to to define lower-level indices for each document, and higher-order indices over a collection of documents. To see how this works, imagine defining 1) a tree index for the text within each document, and 2) a list index over each tree index (one document) within your collection.
+
+To see how this works, imagine you have 3 documents: `doc1`, `doc2`, and `doc3`.
+
+```python
+doc1 = SimpleDirectoryReader('data1').load_data()
+doc2 = SimpleDirectoryReader('data2').load_data()
+doc3 = SimpleDirectoryReader('data3').load_data()
+```
+
+![](/_static/composability/diagram_b0.png)
+
+Now let's define a tree index for each document. In Python, we have:
+
+```python
+index1 = GPTTreeIndex(doc1)
+index2 = GPTTreeIndex(doc2)
+index2 = GPTTreeIndex(doc2)
+```
+
+![](/_static/composability/diagram_b1.png)
+
+We can then create a list index on these 3 tree indices:
+
+```python
+list_index = GPTListIndex([index1, index2, index3])
+```
+
+![](/_static/composability/diagram.png)
+
+During a query, we would start with the top-level list index. Each node in the list corresponds to an underlying tree index. 
+
+```python
+response = list_index.query("Where did the author grow up?")
+```
+
+![](/_static/composability/diagram_q1.png)
+
+So within a node, instead of fetching the text, we would recursively query the stored tree index to retrieve our answer.
+
+![](/_static/composability/diagram_q2.png)
+
+NOTE: You can stack indices as many times as you want, depending on the hierarchies of your knowledge base! 
+
+
+We can take a look at a code example below as well. We first build two tree indices, one over the Wikipedia NYC page, and the other over Paul Graham's essay. We then define a keyword extractor index over the two tree indices.
+
+[Here is an example notebook](https://github.com/jerryjliu/gpt_index/blob/main/examples/composable_indices/ComposableIndices.ipynb).
--- a/docs/how_to/cost_analysis.md
+++ b/docs/how_to/cost_analysis.md
+# Cost Analysis
+
+Each call to an LLM will cost some amount of money - for instance, OpenAI's Davinci costs $0.02 / 1k tokens. The cost of building an index and querying depends on 
+1. the type of LLM used
+2. the type of data structure used
+3. parameters used during building 
+4. parameters used during querying
+
+The cost of building and querying each index is a TODO in the reference documentation. In the meantime, here is a high-level overview of the cost structure of the indices.
+
+### Index Building
+
+
+#### Indices with no LLM calls
+The following indices don't require LLM calls at all during building (0 cost):
+- `GPTListIndex`
+- `GPTSimpleKeywordTableIndex` - uses a regex keyword extractor to extract keywords from each document
+- `GPTRAKEKeywordTableIndex` - uses a RAKE keyword extractor to extract keywords from each document
+
+#### Indices with LLM calls
+The following indices do require LLM calls during build time:
+- `GPTTreeIndex` - use LLM to hierarchically summarize the text to build the tree
+- `GPTKeywordTableIndex` - use LLM to extract keywords from each document
+
+
+### Query Time
+
+There will always be >= 1 LLM call during query time, in order to synthesize the final answer. 
+Some indices contain cost tradeoffs between index building and querying. `GPTListIndex`, for instance,
+is free to build, but running a query over a list index (without filtering or embedding lookups), will
+call the LLM {math}`N` times.
+
+Here are some notes regarding each of the indices:
+- `GPTListIndex`: by default requires {math}`N` LLM calls, where N is the number of nodes.
+    - However, can do `index.query(..., keyword="<keyword>")` to filter out nodes that don't contain the keyword
+- `GPTTreeIndex`: by default requires {math}`\log (N)` LLM calls, where N is the number of leaf nodes. 
+    - Setting `child_branch_factor=2` will be more expensive than the default `child_branch_factor=1` (polynomial vs logarithmic), because we traverse 2 children instead of just 1 for each parent node.
+- `GPTKeywordTableIndex`: by default requires an LLM call to extract query keywords.
+    - Can do `index.query(..., mode="simple")` or `index.query(..., mode="rake")` to also use regex/RAKE keyword extractors on your query text.
\ No newline at end of file
--- a/docs/how_to/custom_llms.md
+++ b/docs/how_to/custom_llms.md
+# Defining LLMs
+
+The goal of GPT Index is to provide a toolkit of data structures that can organize external information in a manner that 
+is easily compatible with the prompt limitations of an LLM. Therefore LLMs are always used to construct the final
+answer.
+Depending on the [type of index](/reference/indices.rst) being used,
+LLMs may also be used during index construction, insertion, and query traversal.
+
+GPT Index uses Langchain's [LLM](https://langchain.readthedocs.io/en/latest/modules/llms.html) 
+and [LLMChain](https://langchain.readthedocs.io/en/latest/modules/chains.html) module to define
+the underlying abstraction. We introduce a wrapper class, 
+[`LLMPredictor`](/reference/llm_predictor.rst), for integration into GPT Index.
+
+By default, we use OpenAI's `text-davinci-002` model. But you may choose to customize
+the underlying LLM being used.
+
+
+## Example
+
+An example snippet of customizing the LLM being used is shown below. 
+In this example, we use `text-davinci-003` instead of `text-davinci-002`. Note that 
+you may plug in any LLM shown on Langchain's 
+[LLM](https://langchain.readthedocs.io/en/latest/modules/llms.html) page.
+
+
+```python
+
+from gpt_index import GPTKeywordTableIndex, SimpleDirectoryReader, LLMPredictor
+from langchain import OpenAI
+
+# define LLM
+llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-002"))
+
+# load index from disk
+index = GPTKeywordTableIndex.load_from_disk('index_table.json', llm_predictor=llm_predictor)
+
+# get response from query
+response = index.query("What did the author do after his time at Y Combinator?")
+
+```
+
+In this snipet, the index has already been created and saved to disk. We load
+the existing index, and swap in a new `LLMPredictor` that is used during query time.
\ No newline at end of file
--- a/docs/how_to/custom_prompts.md
+++ b/docs/how_to/custom_prompts.md
+# Defining Prompts
+
+Prompting is the fundamental input that gives LLMs their expressive power. GPT Index uses prompts to build the index, do insertion, 
+perform traversal during querying, and to synthesize the final answer.
+
+GPT Index uses a finite set of *prompt types*, described [here](/reference/prompts.rst). 
+All index classes, along with their associated queries, utilize a subset of these prompts. The user may provide their own prompt.
+If the user does not provide their own prompt, default prompts are used.
+
+An API reference of all index classes and query classes are found below. The definition of each index class and query
+contains optional prompts that the user may pass in.
+- [Indices](/reference/indices.rst)
+- [Queries](/reference/query.rst)
+
+
+### Example
+
+An example can be found in [this notebook](https://github.com/jerryjliu/gpt_index/blob/main/examples/paul_graham_essay/TestEssay.ipynb).
+
+The corresponding snippet is below. We show how to define a custom Summarization Prompt that not only
+contains a `text` field, but also `query_str` field during construction of `GPTTreeIndex`, so that 
+the answer to the query can be simply synthesized from the root nodes.
+
+```python
+
+from gpt_index import Prompt, GPTTreeIndex, SimpleDirectoryReader
+
+# load documents
+documents = SimpleDirectoryReader('data').load_data()
+# define custom prompt
+query_str = "What did the author do growing up?"
+summary_prompt_tmpl = (
+    "Context information is below. \n"
+    "---------------------\n"
+    "{text}"
+    "\n---------------------\n"
+    "Given the context information and not prior knowledge, "
+    "answer the question: {query_str}\n"
+)
+
+summary_prompt = Prompt(
+    input_variables=["query_str", "text"],
+    template=DEFAULT_TEXT_QA_PROMPT_TMPL
+)
+# Build GPTTreeIndex: pass in custom prompt, also pass in query_str
+index_with_query = GPTTreeIndex(documents, summary_template=summary_prompt, query_str=query_str)
+
+```
+
+Once the index is built, we can retrieve our answer:
+```python
+# directly retrieve response from root nodes instead of traversing tree
+response = index_with_query.query(query_str, mode="retrieve")
+```
--- a/docs/how_to/data_connectors.md
+++ b/docs/how_to/data_connectors.md
+# Data Connectors
+
+We currently offer connectors into the following data sources. External data sources are retrieved through their APIs + corresponding authentication token.
+The API reference documentation can be found [here](/reference/readers.rst).
+
+- [Notion](https://developers.notion.com/) (`NotionPageReader`)
+- [Google Docs](https://developers.google.com/docs/api) (`GoogleDocsReader`)
+- [Slack](https://api.slack.com/) (`SlackReader`)
+- MongoDB (`SimpleMongoReader`)
+- Wikipedia (`WikipediaReader`)
+- local file directory (`SimpleDirectoryReader`)
+
+We offer [example notebooks of connecting to different data sources](https://github.com/jerryjliu/gpt_index/tree/main/examples/data_connectors). Please check them out!
\ No newline at end of file
--- a/docs/how_to/embeddings.md
+++ b/docs/how_to/embeddings.md
+# Embedding support
+
+GPT Index provides embedding support to our tree and list indices. In addition to each node storing text, each node can optionally store an embedding.
+During query-time, we can use embeddings to do max-similarity retrieval of nodes before calling the LLM to synthesize an answer. 
+Since similarity lookup using embeddings (e.g. using cosine similarity) does not require a LLM call, embeddings serve as a cheaper lookup mechanism instead
+of using LLMs to traverse nodes.
+
+NOTE: we currently support OpenAI embeddings. External embeddings are coming soon!
+
+**How are Embeddings Generated?**
+
+Embeddings are lazily generated and then cached at query time (if mode="embedding" is specified during `index.query`), and not during index construction.
+This design choice prevents the need to generate embeddings for all text chunks during index construction.
+
+**Embedding Lookups**
+For the list index:
+- We iterate through every node in the list, and identify the top k nodes through embedding similarity. We use these nodes to synthesize an answer.
+- See the [List Query API](/reference/indices/list_query.rst) for more details.
+
+For the tree index:
+- We start with the root nodes, and traverse down the tree by picking the child node through embedding similarity.
+- See the [Tree Query API](/reference/indices/tree_query.rst) for more details.
+
+**Example Notebook**
+
+An example notebook is given [here](https://github.com/jerryjliu/gpt_index/blob/main/examples/test_wiki/TestNYC_Embeddings.ipynb).
+
--- a/docs/how_to/insert.md
+++ b/docs/how_to/insert.md
+# Insert Capabilities
+
+Every GPT Index data structure allows insertion.
+
+An example notebook showcasing our insert capabilities is given [here](https://github.com/jerryjliu/gpt_index/blob/main/examples/paul_graham_essay/InsertDemo.ipynb).
\ No newline at end of file
--- a/docs/how_to/overview.rst
+++ b/docs/how_to/overview.rst
+Overview
+=====================================
+
+The how-to section contains guides on some of the core features of GPT Index: