From eb3d4af204d2f69511753c48fe0a4bfc3e85c98e Mon Sep 17 00:00:00 2001 From: Emanuel Ferreira <contatoferreirads@gmail.com> Date: Sat, 27 Jan 2024 08:37:24 -0300 Subject: [PATCH] docs: usage metadata extraction (#460) --- apps/docs/docs/modules/data_index.md | 2 +- apps/docs/docs/modules/data_loader.md | 2 +- .../documents_and_nodes/_category_.yml | 2 + .../index.md} | 0 .../metadata_extraction.md | 45 +++++++++++++++++++ apps/docs/docs/modules/embedding.md | 2 +- apps/docs/docs/modules/index.md | 31 ------------- apps/docs/docs/modules/llm.md | 2 +- .../_category_.yml | 2 +- .../{vectorStores => vector_stores}/qdrant.md | 0 10 files changed, 52 insertions(+), 36 deletions(-) create mode 100644 apps/docs/docs/modules/documents_and_nodes/_category_.yml rename apps/docs/docs/modules/{documents_and_nodes.md => documents_and_nodes/index.md} (100%) create mode 100644 apps/docs/docs/modules/documents_and_nodes/metadata_extraction.md delete mode 100644 apps/docs/docs/modules/index.md rename apps/docs/docs/modules/{vectorStores => vector_stores}/_category_.yml (65%) rename apps/docs/docs/modules/{vectorStores => vector_stores}/qdrant.md (100%) diff --git a/apps/docs/docs/modules/data_index.md b/apps/docs/docs/modules/data_index.md index a469817db..2855e4e40 100644 --- a/apps/docs/docs/modules/data_index.md +++ b/apps/docs/docs/modules/data_index.md @@ -1,5 +1,5 @@ --- -sidebar_position: 2 +sidebar_position: 3 --- # Index diff --git a/apps/docs/docs/modules/data_loader.md b/apps/docs/docs/modules/data_loader.md index 977f2f57d..f1b1aa97a 100644 --- a/apps/docs/docs/modules/data_loader.md +++ b/apps/docs/docs/modules/data_loader.md @@ -1,5 +1,5 @@ --- -sidebar_position: 1 +sidebar_position: 2 --- # Reader / Loader diff --git a/apps/docs/docs/modules/documents_and_nodes/_category_.yml b/apps/docs/docs/modules/documents_and_nodes/_category_.yml new file mode 100644 index 000000000..e305c4754 --- /dev/null +++ b/apps/docs/docs/modules/documents_and_nodes/_category_.yml @@ -0,0 +1,2 @@ +label: "Document / Nodes" +position: 0 diff --git a/apps/docs/docs/modules/documents_and_nodes.md b/apps/docs/docs/modules/documents_and_nodes/index.md similarity index 100% rename from apps/docs/docs/modules/documents_and_nodes.md rename to apps/docs/docs/modules/documents_and_nodes/index.md diff --git a/apps/docs/docs/modules/documents_and_nodes/metadata_extraction.md b/apps/docs/docs/modules/documents_and_nodes/metadata_extraction.md new file mode 100644 index 000000000..3782c2f6e --- /dev/null +++ b/apps/docs/docs/modules/documents_and_nodes/metadata_extraction.md @@ -0,0 +1,45 @@ +# Metadata Extraction Usage Pattern + +You can use LLMs to automate metadata extraction with our `Metadata Extractor` modules. + +Our metadata extractor modules include the following "feature extractors": + +- `SummaryExtractor` - automatically extracts a summary over a set of Nodes +- `QuestionsAnsweredExtractor` - extracts a set of questions that each Node can answer +- `TitleExtractor` - extracts a title over the context of each Node by document and combine them +- `KeywordExtractor` - extracts keywords over the context of each Node + +Then you can chain the `Metadata Extractors` with the `IngestionPipeline` to extract metadata from a set of documents. + +```ts +import { + IngestionPipeline, + TitleExtractor, + QuestionsAnsweredExtractor, + Document, + OpenAI, +} from "llamaindex"; + +async function main() { + const openAILLM = new OpenAI({ model: "gpt-3.5-turbo" }); + + const pipeline = new IngestionPipeline({ + transformations: [ + new TitleExtractor(openAILLM), + new QuestionsAnsweredExtractor(openAILLM), + ], + }); + + const nodes = await pipeline.run({ + documents: [ + new Document({ text: "I am 10 years old. John is 20 years old." }), + ], + }); + + for (const node of nodes) { + console.log(node.metadata); + } +} + +main().then(() => console.log("done")); +``` diff --git a/apps/docs/docs/modules/embedding.md b/apps/docs/docs/modules/embedding.md index 84224cffa..bf8e0bacc 100644 --- a/apps/docs/docs/modules/embedding.md +++ b/apps/docs/docs/modules/embedding.md @@ -1,5 +1,5 @@ --- -sidebar_position: 1 +sidebar_position: 2 --- # Embedding diff --git a/apps/docs/docs/modules/index.md b/apps/docs/docs/modules/index.md deleted file mode 100644 index a5f91feb5..000000000 --- a/apps/docs/docs/modules/index.md +++ /dev/null @@ -1,31 +0,0 @@ -# Core Modules - -LlamaIndex.TS offers several core modules, seperated into high-level modules for quickly getting started, and low-level modules for customizing key components as you need. - -## High-Level Modules - -- [**Document**](./high_level/documents_and_nodes.md): A document represents a text file, PDF file or other contiguous piece of data. - -- [**Node**](./high_level/documents_and_nodes.md): The basic data building block. Most commonly, these are parts of the document split into manageable pieces that are small enough to be fed into an embedding model and LLM. - -- [**Reader/Loader**](./high_level/data_loader.md): A reader or loader is something that takes in a document in the real world and transforms into a Document class that can then be used in your Index and queries. We currently support plain text files and PDFs with many many more to come. - -- [**Indexes**](./high_level/data_index.md): indexes store the Nodes and the embeddings of those nodes. - -- [**QueryEngine**](./high_level/query_engine.md): Query engines are what generate the query you put in and give you back the result. Query engines generally combine a pre-built prompt with selected nodes from your Index to give the LLM the context it needs to answer your query. - -- [**ChatEngine**](./high_level/chat_engine.md): A ChatEngine helps you build a chatbot that will interact with your Indexes. - -## Low Level Module - -- [**LLM**](./low_level/llm.md): The LLM class is a unified interface over a large language model provider such as OpenAI GPT-4, Anthropic Claude, or Meta LLaMA. You can subclass it to write a connector to your own large language model. - -- [**Embedding**](./low_level/embedding.md): An embedding is represented as a vector of floating point numbers. OpenAI's text-embedding-ada-002 is our default embedding model and each embedding it generates consists of 1,536 floating point numbers. Another popular embedding model is BERT which uses 768 floating point numbers to represent each Node. We provide a number of utilities to work with embeddings including 3 similarity calculation options and Maximum Marginal Relevance - -- [**TextSplitter/NodeParser**](./low_level/node_parser.md): Text splitting strategies are incredibly important to the overall efficacy of the embedding search. Currently, while we do have a default, there's no one size fits all solution. Depending on the source documents, you may want to use different splitting sizes and strategies. Currently we support spliltting by fixed size, splitting by fixed size with overlapping sections, splitting by sentence, and splitting by paragraph. The text splitter is used by the NodeParser when splitting `Document`s into `Node`s. - -- [**Retriever**](./low_level/retriever.md): The Retriever is what actually chooses the Nodes to retrieve from the index. Here, you may wish to try retrieving more or fewer Nodes per query, changing your similarity function, or creating your own retriever for each individual use case in your application. For example, you may wish to have a separate retriever for code content vs. text content. - -- [**ResponseSynthesizer**](./low_level/response_synthesizer.md): The ResponseSynthesizer is responsible for taking a query string, and using a list of `Node`s to generate a response. This can take many forms, like iterating over all the context and refining an answer, or building a tree of summaries and returning the root summary. - -- [**Storage**](./low_level/storage.md): At some point you're going to want to store your indexes, data and vectors instead of re-running the embedding models every time. IndexStore, DocStore, VectorStore, and KVStore are abstractions that let you do that. Combined, they form the StorageContext. Currently, we allow you to persist your embeddings in files on the filesystem (or a virtual in memory file system), but we are also actively adding integrations to Vector Databases. diff --git a/apps/docs/docs/modules/llm.md b/apps/docs/docs/modules/llm.md index 6de5f4225..7f69f13db 100644 --- a/apps/docs/docs/modules/llm.md +++ b/apps/docs/docs/modules/llm.md @@ -1,5 +1,5 @@ --- -sidebar_position: 1 +sidebar_position: 2 --- # LLM diff --git a/apps/docs/docs/modules/vectorStores/_category_.yml b/apps/docs/docs/modules/vector_stores/_category_.yml similarity index 65% rename from apps/docs/docs/modules/vectorStores/_category_.yml rename to apps/docs/docs/modules/vector_stores/_category_.yml index ac980e1f1..72360abda 100644 --- a/apps/docs/docs/modules/vectorStores/_category_.yml +++ b/apps/docs/docs/modules/vector_stores/_category_.yml @@ -1,2 +1,2 @@ label: "Vector Stores" -position: 0 +position: 1 diff --git a/apps/docs/docs/modules/vectorStores/qdrant.md b/apps/docs/docs/modules/vector_stores/qdrant.md similarity index 100% rename from apps/docs/docs/modules/vectorStores/qdrant.md rename to apps/docs/docs/modules/vector_stores/qdrant.md -- GitLab