Adding comprehensive docs for SimpleDirectoryReader (#9105)

* All missing pages to sitemap * WIP for SDR * Docs for SimpleDirectoryReader

Adding comprehensive docs for SimpleDirectoryReader (#9105)
cf053b98 · Laurie Voss · GitHub · 541e3b1b · cf053b98 · cf053b98
Unverified Commit cf053b98 authored 1 year ago by Laurie Voss Committed by GitHub 1 year ago
--- a/docs/.gitignore
+++ b/docs/.gitignore
+api
--- a/docs/module_guides/loading/loading.md
+++ b/docs/module_guides/loading/loading.md
@@ -2,9 +2,18 @@

 The key to data ingestion in LlamaIndex is the Loaders. Once you have data, you may further refine your Documents and Nodes.

+Once you have [learned about the basics of loading data](/understanding/loading/loading.html) in our Understanding section, you can read on to learn more about:
+
+- [SimpleDirectoryReader](simpledirectoryreader.md), our built-in loader for loading all sorts of file types from a local directory
+- [LlamaHub](connector/root.md), our registry of hundreds of data loading libraries to ingest data from any source
+- [Document and Node objects](documents_and_nodes/root.md) and how to customize them for more advanced use cases
+- [Node parsers](node_parsers/root.md), our set of helper classes to generate nodes from raw text and files
+- [The ingestion pipeline](ingestion_pipeline/root.md) which allows you to set up a repeatable, cache-optimized process for loading data.
+
 ```{toctree}
 ---
 maxdepth: 1
+hidden: true
 ---
 connector/root.md
 documents_and_nodes/root.md

--- a/docs/module_guides/loading/simpledirectoryreader.md
+++ b/docs/module_guides/loading/simpledirectoryreader.md
+# SimpleDirectoryReader
+
+`SimpleDirectoryReader` is the simplest way to load data from local files into LlamaIndex. For production use cases it's more likely that you'll want to use one of the many Readers available on [LlamaHub](https://llamalab.com/hub), but `SimpleDirectoryReader` is a great way to get started.
+
+## Supported file types
+
+By default `SimpleDirectoryReader` will try to read any files it finds, treating them all as text. In addition to plain text, it explicitly supports the following file types, which are automatically detected based on file extension:
+
+- .csv - comma-separated values
+- .docx - Microsoft Word
+- .epub - EPUB ebook format
+- .hwp - Hangul Word Processor
+- .ipynb - Jupyter Notebook
+- .jpeg, .jpg - JPEG image
+- .mbox - MBOX email archive
+- .md - Markdown
+- .mp3, .mp4 - audio and video
+- .pdf - Portable Document Format
+- .png - Portable Network Graphics
+- .ppt, .pptm, .pptx - Microsoft PowerPoint
+
+One file type you may be expecting to find here is JSON; for that we recommend you use our [JSON Loader](https://llamahub.ai/l/file-json).
+
+## Usage
+
+The most basic usage is to pass an `input_dir` and it will load all supported files in that directory:
+
+```python
+from llama_index import SimpleDirectoryReader
+
+reader = SimpleDirectoryReader(input_dir="path/to/directory")
+documents = reader.load_data()
+```
+
+### Reading from subdirectories
+
+By default, `SimpleDirectoryReader` will only read files in the top level of the directory. To read from subdirectories, set `recursive=True`:
+
+```python
+SimpleDirectoryReader(input_dir="path/to/directory", recursive=True)
+```
+
+### Restricting the files loaded
+
+Instead of all files you can pass a list of file paths:
+
+```python
+SimpleDirectoryReader(input_files=["path/to/file1", "path/to/file2"])
+```
+
+or you can pass a list of file paths to **exclude** using `exclude`:
+
+```python
+SimpleDirectoryReader(
+    input_dir="path/to/directory", exclude=["path/to/file1", "path/to/file2"]
+)
+```
+
+You can also set `required_exts` to a list of file extensions to only load files with those extensions:
+
+```python
+SimpleDirectoryReader(
+    input_dir="path/to/directory", required_exts=[".pdf", ".docx"]
+)
+```
+
+And you can set a maximum number of files to be loaded with `num_files_limit`:
+
+```python
+SimpleDirectoryReader(input_dir="path/to/directory", num_files_limit=100)
+```
+
+### Specifying file encoding
+
+`SimpleDirectoryReader` expects files to be `utf-8` encoded but you can override this using the `encoding` parameter:
+
+```python
+SimpleDirectoryReader(input_dir="path/to/directory", encoding="latin-1")
+```
+
+### Extracting metadata
+
+You can specify a function that will read each file and extract metadata that gets attached to the resulting `Document` object for each file by passing the function as `file_metadata`:
+
+```python
+def get_meta(file_path):
+    return {"foo": "bar", "file_path": file_path}
+
+
+SimpleDirectoryReader(input_dir="path/to/directory", file_metadata=get_meta)
+```
+
+The function should take a single argument, the file path, and return a dictionary of metadata.
+
+### Extending to other file types
+
+You can extend `SimpleDirectoryReader` to read other file types by passing a dictionary of file extensions to instances of `BaseReader` as `file_extractor`. A BaseReader should read the file and return a list of Documents. For example, to add custom support for `.myfile` files :
+
+```python
+from llama_index import SimpleDirectoryReader
+from llama_index.readers.base import BaseReader
+from llama_index.schema import Document
+
+
+class MyFileReader(BaseReader):
+    def load_data(self, file, extra_info=None):
+        with open(file, "r") as f:
+            text = f.read()
+        # load_data returns a list of Document objects
+        return [Document(text=text + "Foobar", extra_info=extra_info or {})]
+
+
+reader = SimpleDirectoryReader(
+    input_dir="./data", file_extractor={".myfile": MyFileReader()}
+)
+
+documents = reader.load_data()
+print(documents)
+```
+
+Note that this mapping will override the default file extractors for the file types you specify, so you'll need to add them back in if you want to support them.