Skip to content
Snippets Groups Projects
Unverified Commit cf053b98 authored by Laurie Voss's avatar Laurie Voss Committed by GitHub
Browse files

Adding comprehensive docs for SimpleDirectoryReader (#9105)

* All missing pages to sitemap

* WIP for SDR

* Docs for SimpleDirectoryReader
parent 541e3b1b
Branches
Tags
No related merge requests found
api
......@@ -2,9 +2,18 @@
The key to data ingestion in LlamaIndex is the Loaders. Once you have data, you may further refine your Documents and Nodes.
Once you have [learned about the basics of loading data](/understanding/loading/loading.html) in our Understanding section, you can read on to learn more about:
- [SimpleDirectoryReader](simpledirectoryreader.md), our built-in loader for loading all sorts of file types from a local directory
- [LlamaHub](connector/root.md), our registry of hundreds of data loading libraries to ingest data from any source
- [Document and Node objects](documents_and_nodes/root.md) and how to customize them for more advanced use cases
- [Node parsers](node_parsers/root.md), our set of helper classes to generate nodes from raw text and files
- [The ingestion pipeline](ingestion_pipeline/root.md) which allows you to set up a repeatable, cache-optimized process for loading data.
```{toctree}
---
maxdepth: 1
hidden: true
---
connector/root.md
documents_and_nodes/root.md
......
# SimpleDirectoryReader
`SimpleDirectoryReader` is the simplest way to load data from local files into LlamaIndex. For production use cases it's more likely that you'll want to use one of the many Readers available on [LlamaHub](https://llamalab.com/hub), but `SimpleDirectoryReader` is a great way to get started.
## Supported file types
By default `SimpleDirectoryReader` will try to read any files it finds, treating them all as text. In addition to plain text, it explicitly supports the following file types, which are automatically detected based on file extension:
- .csv - comma-separated values
- .docx - Microsoft Word
- .epub - EPUB ebook format
- .hwp - Hangul Word Processor
- .ipynb - Jupyter Notebook
- .jpeg, .jpg - JPEG image
- .mbox - MBOX email archive
- .md - Markdown
- .mp3, .mp4 - audio and video
- .pdf - Portable Document Format
- .png - Portable Network Graphics
- .ppt, .pptm, .pptx - Microsoft PowerPoint
One file type you may be expecting to find here is JSON; for that we recommend you use our [JSON Loader](https://llamahub.ai/l/file-json).
## Usage
The most basic usage is to pass an `input_dir` and it will load all supported files in that directory:
```python
from llama_index import SimpleDirectoryReader
reader = SimpleDirectoryReader(input_dir="path/to/directory")
documents = reader.load_data()
```
### Reading from subdirectories
By default, `SimpleDirectoryReader` will only read files in the top level of the directory. To read from subdirectories, set `recursive=True`:
```python
SimpleDirectoryReader(input_dir="path/to/directory", recursive=True)
```
### Restricting the files loaded
Instead of all files you can pass a list of file paths:
```python
SimpleDirectoryReader(input_files=["path/to/file1", "path/to/file2"])
```
or you can pass a list of file paths to **exclude** using `exclude`:
```python
SimpleDirectoryReader(
input_dir="path/to/directory", exclude=["path/to/file1", "path/to/file2"]
)
```
You can also set `required_exts` to a list of file extensions to only load files with those extensions:
```python
SimpleDirectoryReader(
input_dir="path/to/directory", required_exts=[".pdf", ".docx"]
)
```
And you can set a maximum number of files to be loaded with `num_files_limit`:
```python
SimpleDirectoryReader(input_dir="path/to/directory", num_files_limit=100)
```
### Specifying file encoding
`SimpleDirectoryReader` expects files to be `utf-8` encoded but you can override this using the `encoding` parameter:
```python
SimpleDirectoryReader(input_dir="path/to/directory", encoding="latin-1")
```
### Extracting metadata
You can specify a function that will read each file and extract metadata that gets attached to the resulting `Document` object for each file by passing the function as `file_metadata`:
```python
def get_meta(file_path):
return {"foo": "bar", "file_path": file_path}
SimpleDirectoryReader(input_dir="path/to/directory", file_metadata=get_meta)
```
The function should take a single argument, the file path, and return a dictionary of metadata.
### Extending to other file types
You can extend `SimpleDirectoryReader` to read other file types by passing a dictionary of file extensions to instances of `BaseReader` as `file_extractor`. A BaseReader should read the file and return a list of Documents. For example, to add custom support for `.myfile` files :
```python
from llama_index import SimpleDirectoryReader
from llama_index.readers.base import BaseReader
from llama_index.schema import Document
class MyFileReader(BaseReader):
def load_data(self, file, extra_info=None):
with open(file, "r") as f:
text = f.read()
# load_data returns a list of Document objects
return [Document(text=text + "Foobar", extra_info=extra_info or {})]
reader = SimpleDirectoryReader(
input_dir="./data", file_extractor={".myfile": MyFileReader()}
)
documents = reader.load_data()
print(documents)
```
Note that this mapping will override the default file extractors for the file types you specify, so you'll need to add them back in if you want to support them.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment