diff --git a/.github/ISSUE_TEMPLATE/docs-form.yml b/.github/ISSUE_TEMPLATE/docs-form.yml index c2b6c959652fbcae787d3f24c206c86dd6e7f8c3..718246c6130e9b4f0f9ba1e77d6b1353c00a422d 100644 --- a/.github/ISSUE_TEMPLATE/docs-form.yml +++ b/.github/ISSUE_TEMPLATE/docs-form.yml @@ -1,12 +1,12 @@ name: Documentation Change Request -description: Request a change to the documenation +description: Request a change to the documentation title: "[Documentation]: " labels: ["documentation", "triage"] body: - type: markdown attributes: value: | - Thanks for taking the time to fill out this documenation request! + Thanks for taking the time to fill out this documentation request! Please complete the following form to help us assist you. - type: textarea id: doc-description @@ -18,7 +18,7 @@ body: - type: input id: docs-link attributes: - label: Documenation Link - description: Please link to the section of documenation that is broken or missing information. + label: Documentation Link + description: Please link to the section of documentation that is broken or missing information. validations: required: true diff --git a/.github/ISSUE_TEMPLATE/feature-form.yml b/.github/ISSUE_TEMPLATE/feature-form.yml index 199e2bb7195534a81bdad0578562bc2cdfaa2135..238681956919124a195c26ed7604b7ddd212433f 100644 --- a/.github/ISSUE_TEMPLATE/feature-form.yml +++ b/.github/ISSUE_TEMPLATE/feature-form.yml @@ -12,14 +12,14 @@ body: id: feature-description attributes: label: Feature Description - description: Describe the feature you are requesting. Try to reference existing implemenations/papers/examples when possible. + description: Describe the feature you are requesting. Try to reference existing implementations/papers/examples when possible. validations: required: true - type: textarea id: why-feature attributes: label: Reason - description: What is stopping LlamaIndex from supporting this feature today? What existing apporaches have not worked for you? + description: What is stopping LlamaIndex from supporting this feature today? What existing approaches have not worked for you? validations: required: false - type: textarea diff --git a/CHANGELOG.md b/CHANGELOG.md index 9fcd4269348ac4e1507e1bf366cd5dccaf1f1dc0..34d68d2da33a066d86ed892d51661fbe7d7b754e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,6 +8,7 @@ ### Bug Fixes / Nits - Fix `LocalAI` chat capability without `max_tokens` (#7942) +- Added `codespell` for automated checking (#7941) ## [0.8.38] - 2023-10-02 @@ -441,7 +442,7 @@ ### Bug Fixes / Nits -- Fix inifinite looping with forced function call in `OpenAIAgent` (#7363) +- Fix infinite looping with forced function call in `OpenAIAgent` (#7363) ## [0.8.6] - 2023-08-22 @@ -1045,7 +1046,7 @@ ### Breaking/Deprecated API Changes - `Node` has been renamed to `TextNode` and is imported from `llama_index.schema` (#6586) -- `TextNode` and `Document` must be instansiated with kwargs: `Document(text=text)` (#6586) +- `TextNode` and `Document` must be instantiated with kwargs: `Document(text=text)` (#6586) - `TextNode` (fka `Node`) has a `id_` or `node_id` property, rather than `doc_id` (#6586) - `TextNode` and `Document` have a metadata property, which replaces the extra_info property (#6586) - `TextNode` no longer has a `node_info` property (start/end indexes are accessed directly with `start/end_char_idx` attributes) (#6586) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 2930468d58bd1c7bb0dec54b104b36aba70b546c..b48aa2e5b06736ae522327c30e495b8b4c996235 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -101,7 +101,7 @@ See [Storage guide](https://gpt-index.readthedocs.io/en/latest/how_to/storage.ht #### Vector Stores -Our vector store classes store embeddings and support lookup via similiarity search. +Our vector store classes store embeddings and support lookup via similarity search. These serve as the main data store and retrieval engine for our vector index. **Interface**: @@ -146,7 +146,7 @@ data if you wish. **Ideas**: -- Besides the "default" retrievers built on top of each index, what about fancier retrievers? E.g. retrievers that take in other retrivers as input? Or other +- Besides the "default" retrievers built on top of each index, what about fancier retrievers? E.g. retrievers that take in other retrievers as input? Or other types of data? --- @@ -190,7 +190,7 @@ See [guide](https://gpt-index.readthedocs.io/en/latest/how_to/query/query_transf A token usage optimizer refines the retrieved `Nodes` to reduce token usage during response synthesis. -**Interface**: `optimize` takes in the `QueryBundle` and a text chunk `str`, and outputs a refined text chunk `str` that yeilds a more optimized response +**Interface**: `optimize` takes in the `QueryBundle` and a text chunk `str`, and outputs a refined text chunk `str` that yields a more optimized response **Examples**: @@ -208,7 +208,7 @@ A node postprocessor refines a list of retrieve nodes given configuration and co - [Keyword Postprocessor](https://github.com/jerryjliu/llama_index/blob/main/llama_index/indices/postprocessor/node.py#L32): filters nodes based on keyword match - [Similarity Postprocessor](https://github.com/jerryjliu/llama_index/blob/main/llama_index/indices/postprocessor/node.py#L62): filers nodes based on similarity threshold -- [Prev Next Postprocessor](https://github.com/jerryjliu/llama_index/blob/main/llama_index/indices/postprocessor/node.py#L135): fetchs additional nodes to augment context based on node relationships. +- [Prev Next Postprocessor](https://github.com/jerryjliu/llama_index/blob/main/llama_index/indices/postprocessor/node.py#L135): fetches additional nodes to augment context based on node relationships. --- diff --git a/Makefile b/Makefile index 4800823abf8d79589d7159a5e05b9686a0ec7686..50c81f94589214574e90c1b0534dd78e00c74893 100644 --- a/Makefile +++ b/Makefile @@ -5,9 +5,10 @@ help: ## Show all Makefile targets .PHONY: format lint format: ## Run code formatter: black black . -lint: ## Run linters: mypy, black, ruff +lint: ## Run linters: mypy, black, codespell, ruff mypy . black . --check + codespell . ruff check . test: ## Run tests pytest tests diff --git a/README.md b/README.md index 1732381f7bf178cd5690486b89c85228b02b81fa..a54a233acb39f6fb3525dc00d9366fa6ef211195 100644 --- a/README.md +++ b/README.md @@ -29,7 +29,7 @@ Discord: https://discord.gg/dGcwcsnxhU. **NOTE**: This README is not updated as frequently as the documentation. Please check out the documentation above for the latest updates! ### Context -- LLMs are a phenomenonal piece of technology for knowledge generation and reasoning. They are pre-trained on large amounts of publicly available data. +- LLMs are a phenomenal piece of technology for knowledge generation and reasoning. They are pre-trained on large amounts of publicly available data. - How do we best augment LLMs with our own private data? We need a comprehensive toolkit to help perform this data augmentation for LLMs. diff --git a/llama_index/callbacks/llama_debug.py b/llama_index/callbacks/llama_debug.py index 4c83f4f35e5d33fe87edeadd99b04ecb229350a0..ce64d24e1815ddc873d4448cf5f3320687e80651 100644 --- a/llama_index/callbacks/llama_debug.py +++ b/llama_index/callbacks/llama_debug.py @@ -134,7 +134,7 @@ class LlamaDebugHandler(BaseCallbackHandler): def get_event_pairs( self, event_type: Optional[CBEventType] = None ) -> List[List[CBEvent]]: - """Pair events by ID, either all events or a sepcific type.""" + """Pair events by ID, either all events or a specific type.""" if event_type is not None: return self._get_event_pairs(self._event_pairs_by_type[event_type]) diff --git a/llama_index/embeddings/base.py b/llama_index/embeddings/base.py index 3aafb268616c9a3cf72fb646ecf67fad92943257..d33b52ca26537979b95bc02155ebcc753899b7bc 100644 --- a/llama_index/embeddings/base.py +++ b/llama_index/embeddings/base.py @@ -150,7 +150,7 @@ class BaseEmbedding(BaseComponent): """Asynchronously get text embedding. By default, this falls back to _get_text_embedding. - Meant to be overriden if there is a true async implementation. + Meant to be overridden if there is a true async implementation. """ return self._get_text_embedding(text) @@ -159,7 +159,7 @@ class BaseEmbedding(BaseComponent): """Get a list of text embeddings. By default, this is a wrapper around _get_text_embedding. - Meant to be overriden for batch queries. + Meant to be overridden for batch queries. """ result = [self._get_text_embedding(text) for text in texts] @@ -169,7 +169,7 @@ class BaseEmbedding(BaseComponent): """Async get a list of text embeddings. By default, this is a wrapper around _aget_text_embedding. - Meant to be overriden for batch queries. + Meant to be overridden for batch queries. """ result = await asyncio.gather( diff --git a/llama_index/embeddings/openai.py b/llama_index/embeddings/openai.py index 996b2df171517035b4e89f2c66b5e8969b22846f..4a99713244d3f930aed1cc1ec5f68c2f99ae7e85 100644 --- a/llama_index/embeddings/openai.py +++ b/llama_index/embeddings/openai.py @@ -355,7 +355,7 @@ class OpenAIEmbedding(BaseEmbedding): """Get text embeddings. By default, this is a wrapper around _get_text_embedding. - Can be overriden for batch queries. + Can be overridden for batch queries. """ return get_embeddings( diff --git a/llama_index/finetuning/openai/validate_json.py b/llama_index/finetuning/openai/validate_json.py index beaff1396d9b956a30dd4c18994623158bfbb107..bb2daaecc6cd382fa073ba9a62bfbbb1ce723c0b 100644 --- a/llama_index/finetuning/openai/validate_json.py +++ b/llama_index/finetuning/openai/validate_json.py @@ -164,7 +164,7 @@ def validate_json(data_path: str) -> None: f"~{n_epochs * n_billing_tokens_in_dataset} tokens" ) - print("As of Augest 22, 2023, fine-tuning gpt-3.5-turbo is $0.008 / 1K Tokens.") + print("As of August 22, 2023, fine-tuning gpt-3.5-turbo is $0.008 / 1K Tokens.") print( "This means your total cost for training will be " f"${n_billing_tokens_in_dataset * 0.008 / 1000} per epoch." diff --git a/llama_index/graph_stores/nebulagraph.py b/llama_index/graph_stores/nebulagraph.py index 0f890649b0a3a0c73b0f596fce909ef2769d183f..a02da7d3a84732efbaf227bad241f3f9bae5cf6f 100644 --- a/llama_index/graph_stores/nebulagraph.py +++ b/llama_index/graph_stores/nebulagraph.py @@ -282,7 +282,7 @@ class NebulaGraphStore(GraphStore): logger.error( f"Connection issue, try to recreate session pool. Query: {query}, " f"Param: {param_map}" - f"Erorr: {e}" + f"Error: {e}" ) self.init_session_pool() logger.info( diff --git a/llama_index/graph_stores/registery.py b/llama_index/graph_stores/registry.py similarity index 100% rename from llama_index/graph_stores/registery.py rename to llama_index/graph_stores/registry.py diff --git a/llama_index/indices/base_retriever.py b/llama_index/indices/base_retriever.py index fe5a1dec11555d4942f8b1b5ca7e93f39c3d2cf1..b113f4d00f48eef5c0edd7f8599b8ac798cff321 100644 --- a/llama_index/indices/base_retriever.py +++ b/llama_index/indices/base_retriever.py @@ -39,7 +39,7 @@ class BaseRetriever(ABC): # TODO: make this abstract # @abstractmethod async def _aretrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]: - """Asyncronously retrieve nodes given query. + """Asynchronously retrieve nodes given query. Implemented by the user. diff --git a/llama_index/indices/keyword_table/base.py b/llama_index/indices/keyword_table/base.py index 2655b69c0de7f57a0fefa2408b27d35dc36e9917..808ea1a9a7c1ef4fbe159bdffc828d418b3393b3 100644 --- a/llama_index/indices/keyword_table/base.py +++ b/llama_index/indices/keyword_table/base.py @@ -40,7 +40,7 @@ class BaseKeywordTableIndex(BaseIndex[KeywordTable]): """Base Keyword Table Index. This index extracts keywords from the text, and maps each - keyword to the node(s) that it corresponds to. In this sense it mimicks a + keyword to the node(s) that it corresponds to. In this sense it mimics a "hash table". During index construction, the keyword table is constructed by extracting keywords from each node and creating an internal mapping. diff --git a/llama_index/indices/knowledge_graph/retrievers.py b/llama_index/indices/knowledge_graph/retrievers.py index f799ad0eb3ef7b6e29bf751b6ef6486f7ae765dc..d962a2f3afd5795405ec190af0a07b623e695138 100644 --- a/llama_index/indices/knowledge_graph/retrievers.py +++ b/llama_index/indices/knowledge_graph/retrievers.py @@ -59,7 +59,7 @@ class KGTableRetriever(BaseRetriever): num_chunks_per_query (int): Maximum number of text chunks to query. include_text (bool): Use the document text source from each relevant triplet during queries. - retriever_mode (KGRetrieverMode): Specifies whether to use keyowrds, + retriever_mode (KGRetrieverMode): Specifies whether to use keywords, embeddings, or both to find relevant triplets. Should be one of "keyword", "embedding", or "hybrid". similarity_top_k (int): The number of top embeddings to use @@ -144,7 +144,7 @@ class KGTableRetriever(BaseRetriever): node_visited = set() keywords = self._get_keywords(query_bundle.query_str) if self._verbose: - print_text(f"Extraced keywords: {keywords}\n", color="green") + print_text(f"Extracted keywords: {keywords}\n", color="green") rel_texts = [] cur_rel_map = {} chunk_indices_count: Dict[str, int] = defaultdict(int) @@ -241,7 +241,7 @@ class KGTableRetriever(BaseRetriever): rel_texts[j] = "" rel_texts = [rel_text for rel_text in rel_texts if rel_text != ""] - # tuncate rel_texts + # truncate rel_texts rel_texts = rel_texts[: self.max_knowledge_sequence] sorted_chunk_indices = sorted( @@ -709,7 +709,7 @@ class KnowledgeGraphRAGRetriever(BaseRetriever): return [] # Get entities entities = self._get_entities(query_bundle.query_str) - # Before we enable embedding/symantic search, we need to make sure + # Before we enable embedding/semantic search, we need to make sure # we don't miss any entities that's synoynm of the entities we extracted # in string matching based retrieval in following steps, thus we expand # synonyms here. @@ -730,7 +730,7 @@ class KnowledgeGraphRAGRetriever(BaseRetriever): return [] # Get entities entities = await self._aget_entities(query_bundle.query_str) - # Before we enable embedding/symantic search, we need to make sure + # Before we enable embedding/semantic search, we need to make sure # we don't miss any entities that's synoynm of the entities we extracted # in string matching based retrieval in following steps, thus we expand # synonyms here. diff --git a/llama_index/indices/postprocessor/optimizer.py b/llama_index/indices/postprocessor/optimizer.py index 90e975a3dff9dde5652689c1fc3d8e3d6b248ef5..33ebe3d887c4e2fb0c0f87d653b4147727e6644d 100644 --- a/llama_index/indices/postprocessor/optimizer.py +++ b/llama_index/indices/postprocessor/optimizer.py @@ -21,7 +21,7 @@ class SentenceEmbeddingOptimizer(BaseNodePostprocessor): description="Percentile cutoff for the top k sentences to use." ) threshold_cutoff: Optional[float] = Field( - description="Threshold cutoff for similiarity for each sentence to use." + description="Threshold cutoff for similarity for each sentence to use." ) _embed_model: BaseEmbedding = PrivateAttr() diff --git a/llama_index/llms/anthropic.py b/llama_index/llms/anthropic.py index 157d9bfdb9e07ff90239ea858d32dd13775ebfdb..5005e05719d18eb22fcf950fc36c3554afd6dc32 100644 --- a/llama_index/llms/anthropic.py +++ b/llama_index/llms/anthropic.py @@ -42,7 +42,7 @@ class Anthropic(LLM): default=10, description="The maximum number of API retries." ) additional_kwargs: Dict[str, Any] = Field( - default_factory=dict, description="Additonal kwargs for the anthropic API." + default_factory=dict, description="Additional kwargs for the anthropic API." ) _client: Any = PrivateAttr() diff --git a/llama_index/llms/konko.py b/llama_index/llms/konko.py index f842ba1a184ffefde4ae5f7ffc36357b02a05ccc..43fbcd99590e8d23df45335fc03af4e397d0dd68 100644 --- a/llama_index/llms/konko.py +++ b/llama_index/llms/konko.py @@ -41,12 +41,12 @@ class Konko(LLM): class_type = "konko" model: str = Field(description="The konko model to use.") - temperature: float = Field(description="The tempature to use during generation.") + temperature: float = Field(description="The temperature to use during generation.") max_tokens: Optional[int] = Field( description="The maximum number of tokens to generate." ) additional_kwargs: Dict[str, Any] = Field( - default_factory=dict, description="Additonal kwargs for the konko API." + default_factory=dict, description="Additional kwargs for the konko API." ) max_retries: int = Field(description="The maximum number of API retries.") diff --git a/llama_index/llms/litellm.py b/llama_index/llms/litellm.py index 22531b471687bdaadbed37255c3c139b376d2d50..ba85f9037e1434b75e00c6689837de40a145c24c 100644 --- a/llama_index/llms/litellm.py +++ b/llama_index/llms/litellm.py @@ -44,13 +44,13 @@ class LiteLLM(LLM): model: str = Field( description="The LiteLLM model to use." ) # For complete list of providers https://docs.litellm.ai/docs/providers - temperature: float = Field(description="The tempature to use during generation.") + temperature: float = Field(description="The temperature to use during generation.") max_tokens: Optional[int] = Field( description="The maximum number of tokens to generate." ) additional_kwargs: Dict[str, Any] = Field( default_factory=dict, - description="Additonal kwargs for the LLM API.", + description="Additional kwargs for the LLM API.", # for all inputs https://docs.litellm.ai/docs/completion/input ) max_retries: int = Field(description="The maximum number of API retries.") diff --git a/llama_index/llms/llama_api.py b/llama_index/llms/llama_api.py index 1d3721031f5dcbcd4c2f3192e8a0e61b0f34e810..6d0d3ba5152b8f865adbb8dec595537d260ae520 100644 --- a/llama_index/llms/llama_api.py +++ b/llama_index/llms/llama_api.py @@ -27,7 +27,7 @@ class LlamaAPI(CustomLLM): temperature: float = Field(description="The temperature to use for sampling.") max_tokens: int = Field(description="The maximum number of tokens to generate.") additional_kwargs: Dict[str, Any] = Field( - default_factory=dict, description="Additonal kwargs for the llama-api API." + default_factory=dict, description="Additional kwargs for the llama-api API." ) _client: Any = PrivateAttr() diff --git a/llama_index/llms/ollama.py b/llama_index/llms/ollama.py index ab09b1da339cdcabc35a96d7dccea962ca41886f..6b87492b62b30adc3de7078d87e36b337fd3ad8e 100644 --- a/llama_index/llms/ollama.py +++ b/llama_index/llms/ollama.py @@ -31,7 +31,7 @@ class Ollama(CustomLLM): ) prompt_key: str = Field(description="The key to use for the prompt in API calls.") additional_kwargs: Dict[str, Any] = Field( - default_factory=dict, description="Additonal kwargs for the Ollama API." + default_factory=dict, description="Additional kwargs for the Ollama API." ) _messages_to_prompt: Callable = PrivateAttr() diff --git a/llama_index/llms/openai.py b/llama_index/llms/openai.py index c624520fd8b20f00d7dce7ae8086d58129c410d7..545e75ecb0aa80b01d2b71f82600fe301faf349b 100644 --- a/llama_index/llms/openai.py +++ b/llama_index/llms/openai.py @@ -42,12 +42,12 @@ class OpenAI(LLM): class_type = "openai" model: str = Field(description="The OpenAI model to use.") - temperature: float = Field(description="The tempature to use during generation.") + temperature: float = Field(description="The temperature to use during generation.") max_tokens: Optional[int] = Field( description="The maximum number of tokens to generate." ) additional_kwargs: Dict[str, Any] = Field( - default_factory=dict, description="Additonal kwargs for the OpenAI API." + default_factory=dict, description="Additional kwargs for the OpenAI API." ) max_retries: int = Field(description="The maximum number of API retries.") diff --git a/llama_index/llms/portkey.py b/llama_index/llms/portkey.py index a934c8f79dc62a99225fece48dbffcd5914fc210..ff85f717f8cd9f0c6e354947cc464bcef9c10311 100644 --- a/llama_index/llms/portkey.py +++ b/llama_index/llms/portkey.py @@ -1,5 +1,5 @@ """ - Portkey intergation with Llama_index for enchanced monitoring + Portkey integration with Llama_index for enhanced monitoring """ from typing import Any, Optional, Sequence, Union, List, TYPE_CHECKING, cast diff --git a/llama_index/llms/replicate.py b/llama_index/llms/replicate.py index 3746bb50643a93bfe4c84bade5b5ff77b2dc3cda..61ee4d78fed2741520dd10fdf4e4760816acc246 100644 --- a/llama_index/llms/replicate.py +++ b/llama_index/llms/replicate.py @@ -30,7 +30,7 @@ class Replicate(CustomLLM): ) prompt_key: str = Field(description="The key to use for the prompt in API calls.") additional_kwargs: Dict[str, Any] = Field( - default_factory=dict, description="Additonal kwargs for the Replicate API." + default_factory=dict, description="Additional kwargs for the Replicate API." ) _messages_to_prompt: Callable = PrivateAttr() diff --git a/llama_index/llms/rungpt.py b/llama_index/llms/rungpt.py index 776da8d3699c6b818ce42b7d3d14e11d2f8403ef..dc3253c15de50a55ad6cc4aaab17d7db776e1124 100644 --- a/llama_index/llms/rungpt.py +++ b/llama_index/llms/rungpt.py @@ -33,7 +33,7 @@ class RunGptLLM(LLM): description="The maximum number of context tokens for the model." ) additional_kwargs: Dict[str, Any] = Field( - default_factory=dict, description="Additonal kwargs for the Replicate API." + default_factory=dict, description="Additional kwargs for the Replicate API." ) base_url: str = Field( description="The address of your target model served by rungpt." diff --git a/llama_index/memory/types.py b/llama_index/memory/types.py index b887803b79e8b92f311192aecfceae48c58e6113..a83dee03c08e4b289801c8323eb3b33c7f6f5950 100644 --- a/llama_index/memory/types.py +++ b/llama_index/memory/types.py @@ -19,7 +19,7 @@ class BaseMemory(BaseModel): chat_history: Optional[List[ChatMessage]] = None, llm: Optional[LLM] = None, ) -> "BaseMemory": - """Create a chat memory from defualts.""" + """Create a chat memory from defaults.""" pass @abstractmethod diff --git a/llama_index/node_parser/extractors/metadata_extractors.py b/llama_index/node_parser/extractors/metadata_extractors.py index a90fb1648250910a424e2fd9328fded4574cdf1f..d68296e7efe3472dd0c90ed45f831dbb08f7822e 100644 --- a/llama_index/node_parser/extractors/metadata_extractors.py +++ b/llama_index/node_parser/extractors/metadata_extractors.py @@ -521,7 +521,7 @@ class EntityExtractor(MetadataFeatureExtractor): prediction_threshold: float = Field( default=0.5, description="The confidence threshold for accepting predictions." ) - span_joiner: str = Field(description="The seperator beween entity names.") + span_joiner: str = Field(description="The separator between entity names.") label_entities: bool = Field( default=False, description="Include entity class labels or not." ) diff --git a/llama_index/query_engine/knowledge_graph_query_engine.py b/llama_index/query_engine/knowledge_graph_query_engine.py index f02bb8924a17957e434dae7aec8ca660202be682..4ea934908f61136fa435076bf5c8cbf8baf5485c 100644 --- a/llama_index/query_engine/knowledge_graph_query_engine.py +++ b/llama_index/query_engine/knowledge_graph_query_engine.py @@ -4,7 +4,7 @@ import logging from typing import Any, List, Optional, Sequence from llama_index.callbacks.schema import CBEventType, EventPayload -from llama_index.graph_stores.registery import ( +from llama_index.graph_stores.registry import ( GRAPH_STORE_CLASS_TO_GRAPH_STORE_TYPE, GraphStoreType, ) diff --git a/llama_index/readers/chroma.py b/llama_index/readers/chroma.py index 9c7037abea155a816f059cf52f728f84174a506b..5f42857d5fe82a58cd14be28204d54cfdda54ae0 100644 --- a/llama_index/readers/chroma.py +++ b/llama_index/readers/chroma.py @@ -12,7 +12,7 @@ class ChromaReader(BaseReader): Retrieve documents from existing persisted Chroma collections. Args: - collection_name: Name of the peristed collection. + collection_name: Name of the persisted collection. persist_directory: Directory where the collection is persisted. """ diff --git a/llama_index/readers/deeplake.py b/llama_index/readers/deeplake.py index 52dddbd3da87fb7b7f184fbc8ec77c6f124421fc..d2e79ff0046a3f620e6bfa7cdb9182cedd08a14e 100644 --- a/llama_index/readers/deeplake.py +++ b/llama_index/readers/deeplake.py @@ -27,7 +27,7 @@ def vector_search( data_vectors: np.ndarray limit (int): number of nearest neighbors distance_metric: distance function 'L2' for Euclidean, 'L1' for Nuclear, 'Max' - l-infinity distnace, 'cos' for cosine similarity, 'dot' for dot product + l-infinity distance, 'cos' for cosine similarity, 'dot' for dot product returns: nearest_indices: List, indices of nearest neighbors """ @@ -81,7 +81,7 @@ class DeepLakeReader(BaseReader): """Load data from DeepLake. Args: - dataset_name (str): Name of the DeepLake dataet. + dataset_name (str): Name of the DeepLake dataset. query_vector (List[float]): Query vector. limit (int): Number of results to return. diff --git a/llama_index/readers/github_readers/github_repository_reader.py b/llama_index/readers/github_readers/github_repository_reader.py index f05d6444c89d497417b54be945565d519d82c353..ef3265e7b0a1819752a01ccaa7f6eb67112b204f 100644 --- a/llama_index/readers/github_readers/github_repository_reader.py +++ b/llama_index/readers/github_readers/github_repository_reader.py @@ -201,7 +201,7 @@ class GithubRepositoryReader(BaseReader): :param `current_path`: current path of the tree :param `current_depth`: current depth of the tree :return: list of tuples of - (tree object, file's full path realtive to the root of the repo) + (tree object, file's full path relative to the root of the repo) """ blobs_and_full_paths: List[Tuple[GitTreeResponseModel.GitTreeObject, str]] = [] print_if_verbose( @@ -255,7 +255,7 @@ class GithubRepositoryReader(BaseReader): Generate documents from a list of blobs and their full paths. :param `blobs_and_paths`: list of tuples of - (tree object, file's full path in the repo realtive to the root of the repo) + (tree object, file's full path in the repo relative to the root of the repo) :return: list of documents """ buffered_iterator = BufferedGitBlobDataIterator( diff --git a/llama_index/retrievers/auto_merging_retriever.py b/llama_index/retrievers/auto_merging_retriever.py index d43d85ca4f51071a0480aec922d29970e121e3b4..437b66076cf3e433421c7178c5d63f558e9692c7 100644 --- a/llama_index/retrievers/auto_merging_retriever.py +++ b/llama_index/retrievers/auto_merging_retriever.py @@ -18,7 +18,7 @@ logger = logging.getLogger(__name__) class AutoMergingRetriever(BaseRetriever): """This retriever will try to merge context into parent context. - The retreiver first retrieves chunks from a vector store. + The retriever first retrieves chunks from a vector store. Then, it will try to merge the chunks into a single context. """ diff --git a/llama_index/schema.py b/llama_index/schema.py index db4a2ac9a4d3a706f95f60d17f7c53968bd9285d..e4b17cedb9998032a5a374bd27918653a909dd6d 100644 --- a/llama_index/schema.py +++ b/llama_index/schema.py @@ -20,7 +20,7 @@ WRAP_WIDTH = 70 class BaseComponent(BaseModel): - """Base component object to caputure class names.""" + """Base component object to capture class names.""" @classmethod @abstractmethod @@ -135,11 +135,11 @@ class BaseNode(BaseComponent): ) excluded_embed_metadata_keys: List[str] = Field( default_factory=list, - description="Metadata keys that are exluded from text for the embed model.", + description="Metadata keys that are excluded from text for the embed model.", ) excluded_llm_metadata_keys: List[str] = Field( default_factory=list, - description="Metadata keys that are exluded from text for the LLM.", + description="Metadata keys that are excluded from text for the LLM.", ) relationships: Dict[NodeRelationship, RelatedNodeType] = Field( default_factory=dict, @@ -294,7 +294,7 @@ class TextNode(BaseNode): ) metadata_seperator: str = Field( default="\n", - description="Seperator between metadata fields when converting to string.", + description="Separator between metadata fields when converting to string.", ) @classmethod diff --git a/llama_index/storage/kvstore/firestore_kvstore.py b/llama_index/storage/kvstore/firestore_kvstore.py index d02991ba5c554b6336b147a15bccdad5ae9056a8..aec773dd98a38a5a28e50f01f7f163c194eae225 100644 --- a/llama_index/storage/kvstore/firestore_kvstore.py +++ b/llama_index/storage/kvstore/firestore_kvstore.py @@ -1,7 +1,7 @@ from typing import Any, Dict, Optional from llama_index.storage.kvstore.types import DEFAULT_COLLECTION, BaseKVStore -# keyword "_" is reserved in Firestore but refered in llama_index/constants.py. +# keyword "_" is reserved in Firestore but referred in llama_index/constants.py. FIELD_NAME_REPLACE_SET = {"__data__": "data", "__type__": "type"} FIELD_NAME_REPLACE_GET = {"data": "__data__", "type": "__type__"} diff --git a/llama_index/text_splitter/code_splitter.py b/llama_index/text_splitter/code_splitter.py index 7d91d2ce080de67e8a0a7ce31e277175aaa9d092..850b60e60b2f50e7df45dae7ff4497bd21d2f039 100644 --- a/llama_index/text_splitter/code_splitter.py +++ b/llama_index/text_splitter/code_splitter.py @@ -20,7 +20,7 @@ class CodeSplitter(TextSplitter): """ language: str = Field( - description="The programming languge of the code being split." + description="The programming language of the code being split." ) chunk_lines: int = Field( default=DEFAULT_CHUNK_LINES, diff --git a/llama_index/tools/tool_spec/load_and_search/README.md b/llama_index/tools/tool_spec/load_and_search/README.md index 0db59c3de5b0665800029cbc13402bef6318c1b4..155e9de13c5f1611cbffbd7b1db7ea856810b803 100644 --- a/llama_index/tools/tool_spec/load_and_search/README.md +++ b/llama_index/tools/tool_spec/load_and_search/README.md @@ -1,6 +1,6 @@ # LoadAndSearch Tool -This Tool Spec is intended to wrap other tools, allowing the Agent to perform seperate loading and reading of data. This is very useful for when tools return information larger than or closer to the size of the context window. +This Tool Spec is intended to wrap other tools, allowing the Agent to perform separate loading and reading of data. This is very useful for when tools return information larger than or closer to the size of the context window. ## Usage @@ -17,7 +17,7 @@ wiki_spec = WikipediaToolSpec() # Get the search_data tool from the wikipedia tool spec tool = wiki_spec.to_tool_list()[1] -# Wrap the tool, spliting into a loader and a reader +# Wrap the tool, splitting into a loader and a reader agent = OpenAIAgent.from_tools( LoadAndSearchToolSpec.from_defaults( tool diff --git a/llama_index/utils.py b/llama_index/utils.py index 45ea56b559f71f51ce7b5793fc79c6ddb8e29932..e52e00eb5cade1ee628f36d323b336a0b89f9269 100644 --- a/llama_index/utils.py +++ b/llama_index/utils.py @@ -303,7 +303,7 @@ def add_sync_version(func: Any) -> Any: # Sample text from llama_index's readme SAMPLE_TEXT = """ Context -LLMs are a phenomenonal piece of technology for knowledge generation and reasoning. +LLMs are a phenomenal piece of technology for knowledge generation and reasoning. They are pre-trained on large amounts of publicly available data. How do we best augment LLMs with our own private data? We need a comprehensive toolkit to help perform this data augmentation for LLMs. diff --git a/llama_index/vector_stores/deeplake.py b/llama_index/vector_stores/deeplake.py index d29d1bd519e5d03aeecce07b110a0b834a8f1756..a122990599f4f63e016aacb09ffd595c09d3a012 100644 --- a/llama_index/vector_stores/deeplake.py +++ b/llama_index/vector_stores/deeplake.py @@ -33,7 +33,7 @@ class DeepLakeVectorStore(VectorStoreBase): In this vector store we store the text, its embedding and a few pieces of its metadata in a deeplake dataset. This implemnetation allows the use of an already existing deeplake dataset if it is one that was created - this vector store. It also supports creating a new one if the dataset doesnt + this vector store. It also supports creating a new one if the dataset doesn't exist or if `overwrite` is set to True. """ diff --git a/llama_index/vector_stores/docarray/hnsw.py b/llama_index/vector_stores/docarray/hnsw.py index 61438e6072f0975096ca9f75c6279b462de9ec0a..eccc791bbe60d32b2b407ea4fa9d2d0d74c03100 100644 --- a/llama_index/vector_stores/docarray/hnsw.py +++ b/llama_index/vector_stores/docarray/hnsw.py @@ -36,7 +36,7 @@ class DocArrayHnswVectorStore(DocArrayVectorStore): ef_construction (int, optional): defines a construction time/accuracy trade-off. Default is 200. ef (int, optional): The size of the dynamic candidate list. Default is 10. - M (int, optional): defines tha maximum number of outgoing connections + M (int, optional): defines the maximum number of outgoing connections in the graph. Default is 16. allow_replace_deleted (bool, optional): Whether to allow replacing deleted elements. Default is True. diff --git a/llama_index/vector_stores/epsilla.py b/llama_index/vector_stores/epsilla.py index a16e489cc3c677680111f6882d79792f3207c362..fd7b1805cc5a9906d16749c825c209b818ee7444 100644 --- a/llama_index/vector_stores/epsilla.py +++ b/llama_index/vector_stores/epsilla.py @@ -158,7 +158,7 @@ class EpsillaVectorStore(VectorStore): Returns: List[str]: List of ids inserted. """ - # If the collection doesnt exist yet, create the collection + # If the collection doesn't exist yet, create the collection if not self._collection_created and len(nodes) > 0: dimension = len(nodes[0].get_embedding()) self._create_collection(dimension) diff --git a/llama_index/vector_stores/milvus.py b/llama_index/vector_stores/milvus.py index 528bdf7c8c4c701c4f08ebdb95ef115c3b9cda62..7ad657a587179aa343dafc9ba5daa579c5ce7b70 100644 --- a/llama_index/vector_stores/milvus.py +++ b/llama_index/vector_stores/milvus.py @@ -43,7 +43,7 @@ class MilvusVectorStore(VectorStore): In this vector store we store the text, its embedding and a its metadata in a Milvus collection. This implementation allows the use of an already existing collection. - It also supports creating a new one if the collection doesnt + It also supports creating a new one if the collection doesn't exist or if `overwrite` is set to True. Args: @@ -54,7 +54,7 @@ class MilvusVectorStore(VectorStore): collection_name (str, optional): The name of the collection where data will be stored. Defaults to "llamalection". dim (int, optional): The dimension of the embedding vectors for the collection. - Required if creating a new colletion. + Required if creating a new collection. embedding_field (str, optional): The name of the embedding field for the collection, defaults to DEFAULT_EMBEDDING_KEY. doc_id_field (str, optional): The name of the doc_id field for the collection, @@ -252,7 +252,7 @@ class MilvusVectorStore(VectorStore): if len(expr) != 0: string_expr = " and ".join(expr) - # Perfom the search + # Perform the search res = self.milvusclient.search( collection_name=self.collection_name, data=[query.query_embedding], diff --git a/llama_index/vector_stores/rocksetdb.py b/llama_index/vector_stores/rocksetdb.py index ab3e47f93f1ab7587820b33406532f511c380d1e..dc0295814e76f8f0adadb81bd0ffaed5d1eafa67 100644 --- a/llama_index/vector_stores/rocksetdb.py +++ b/llama_index/vector_stores/rocksetdb.py @@ -251,7 +251,7 @@ class RocksetVectorStore(VectorStore): collection will do no vector enforcement. collection (str): The name of the collection to be created client (Optional[Any]): Rockset client object - workspace (str): The workspace containing the colleciton to be + workspace (str): The workspace containing the collection to be created (default: "commons") text_key (str): The key to the text of nodes (default: llama_index.vector_stores.utils.DEFAULT_TEXT_KEY) diff --git a/llama_index/vector_stores/timescalevector.py b/llama_index/vector_stores/timescalevector.py index 09d4ecddd3517c1fe0dc23a9278ce7d44d23aefb..b5573975fc8c6dcda43e07f966be0cca96e9e54d 100644 --- a/llama_index/vector_stores/timescalevector.py +++ b/llama_index/vector_stores/timescalevector.py @@ -68,7 +68,7 @@ class TimescaleVectorStore(VectorStore): def _create_clients(self) -> None: from timescale_vector import client - # in the normal case doen't restrict the id type to even uuid. + # in the normal case doesn't restrict the id type to even uuid. # Allow arbitrary text id_type = "TEXT" if self.time_partition_interval is not None: diff --git a/llama_index/vector_stores/utils.py b/llama_index/vector_stores/utils.py index cf029d5fef73b906000a8864240d69f34f7ff257..776976a3b44af48dd6717a34a7550e83ec69095d 100644 --- a/llama_index/vector_stores/utils.py +++ b/llama_index/vector_stores/utils.py @@ -42,7 +42,7 @@ def node_to_metadata_dict( # remove embedding from node_dict node_dict["embedding"] = None - # dump remainer of node_dict to json string + # dump remainder of node_dict to json string metadata["_node_content"] = json.dumps(node_dict) # store ref doc id at top level to allow metadata filtering diff --git a/pyproject.toml b/pyproject.toml index f751c1604a1444be3639acf1e3ebc6c715dfddd4..7e9401aaa03db1d116c36a970feb737eee1297c4 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -14,3 +14,12 @@ exclude = [ "notebooks", ".git" ] + +[tool.codespell] +check-filenames = true +check-hidden = true +ignore-words-list = "ot" +# Remove .git and venv skips when integrated with pre-commit +# Feel free to un-skip docs, examples, and experimental, you will just need to +# work through many typos (--write-changes and --interactive will help) +skip = "./.git,./*venv,./docs,./examples,./experimental,*.csv,*.html,*.json,*.jsonl,*.pdf,*.txt" diff --git a/requirements.txt b/requirements.txt index 7c25b18f5af4f4780ee6628260925b10be173b74..09b6c0e42e7ce8962d7d9ea9ac9248b78b93516a 100644 --- a/requirements.txt +++ b/requirements.txt @@ -17,6 +17,7 @@ types-setuptools==67.1.0.0 # linting black[jupyter]==23.7.0 +codespell[toml]>=v2.2.6 mypy==0.991 pre-commit==3.2.0 pylint==2.15.10 diff --git a/tests/indices/postprocessor/test_base.py b/tests/indices/postprocessor/test_base.py index 519f51ee13a17be016c9ce95fe1b5c98af62de11..e8d80b22d27ed1039e88f1397eaef4c9293be816 100644 --- a/tests/indices/postprocessor/test_base.py +++ b/tests/indices/postprocessor/test_base.py @@ -280,7 +280,7 @@ def test_time_weighted_postprocessor() -> None: assert cast(Dict, nodes[3].metadata)[key] != 3 # low time decay - # artifically make earlier nodes more relevant + # artificially make earlier nodes more relevant # therefore postprocessor should still rank earlier nodes higher nodes = [ TextNode(text="Hello world.", id_="1", metadata={key: 0}), diff --git a/tests/indices/query/test_compose_vector.py b/tests/indices/query/test_compose_vector.py index 53e0b2c04b0cb64708b5f9c799eb1208af98b2a7..f6d89bd811b0c1d24c3648af19657bfb587bdfbe 100644 --- a/tests/indices/query/test_compose_vector.py +++ b/tests/indices/query/test_compose_vector.py @@ -165,7 +165,7 @@ def test_recursive_query_vector_table_query_configs( """ vector_kwargs = index_kwargs["vector"] table_kwargs = index_kwargs["table"] - # try building a tre for a group of 4, then a list + # try building a tree for a group of 4, then a list # use a diff set of documents # try building a list for every two, then a tree vector1 = VectorStoreIndex.from_documents( diff --git a/tests/indices/response/test_response_builder.py b/tests/indices/response/test_response_builder.py index 7f5223a5f3dc2f8e7d1121a20150ee61565ca0ec..316a4ceb659489f82d19dccd5eec88659b8b535e 100644 --- a/tests/indices/response/test_response_builder.py +++ b/tests/indices/response/test_response_builder.py @@ -56,7 +56,7 @@ def test_give_response( def test_compact_response(mock_service_context: ServiceContext) -> None: """Test give response.""" # test response with ResponseMode.COMPACT - # NOTE: here we want to guarante that prompts have 0 extra tokens + # NOTE: here we want to guarantee that prompts have 0 extra tokens mock_refine_prompt_tmpl = "{query_str}{existing_answer}{context_msg}" mock_refine_prompt = PromptTemplate( mock_refine_prompt_tmpl, prompt_type=PromptType.REFINE @@ -108,7 +108,7 @@ def test_accumulate_response( ) -> None: """Test accumulate response.""" # test response with ResponseMode.ACCUMULATE - # NOTE: here we want to guarante that prompts have 0 extra tokens + # NOTE: here we want to guarantee that prompts have 0 extra tokens mock_qa_prompt_tmpl = "{context_str}{query_str}" mock_qa_prompt = PromptTemplate( mock_qa_prompt_tmpl, prompt_type=PromptType.QUESTION_ANSWER @@ -167,7 +167,7 @@ def test_accumulate_response_async( ) -> None: """Test accumulate response.""" # test response with ResponseMode.ACCUMULATE - # NOTE: here we want to guarante that prompts have 0 extra tokens + # NOTE: here we want to guarantee that prompts have 0 extra tokens mock_qa_prompt_tmpl = "{context_str}{query_str}" mock_qa_prompt = PromptTemplate( mock_qa_prompt_tmpl, prompt_type=PromptType.QUESTION_ANSWER @@ -227,7 +227,7 @@ def test_accumulate_response_aget( ) -> None: """Test accumulate response.""" # test response with ResponseMode.ACCUMULATE - # NOTE: here we want to guarante that prompts have 0 extra tokens + # NOTE: here we want to guarantee that prompts have 0 extra tokens mock_qa_prompt_tmpl = "{context_str}{query_str}" mock_qa_prompt = PromptTemplate( mock_qa_prompt_tmpl, prompt_type=PromptType.QUESTION_ANSWER @@ -289,7 +289,7 @@ def test_accumulate_response_aget( def test_accumulate_compact_response(patch_llm_predictor: None) -> None: """Test accumulate response.""" # test response with ResponseMode.ACCUMULATE - # NOTE: here we want to guarante that prompts have 0 extra tokens + # NOTE: here we want to guarantee that prompts have 0 extra tokens mock_qa_prompt_tmpl = "{context_str}{query_str}" mock_qa_prompt = PromptTemplate( mock_qa_prompt_tmpl, prompt_type=PromptType.QUESTION_ANSWER diff --git a/tests/response_synthesizers/test_refine.py b/tests/response_synthesizers/test_refine.py index b96e38d742c95aba9cea411a5cd984def28b8064..535daa21d7d955500a132f1ef0ae47cef37f3eb1 100644 --- a/tests/response_synthesizers/test_refine.py +++ b/tests/response_synthesizers/test_refine.py @@ -72,14 +72,14 @@ def refine_instance(mock_refine_service_context: ServiceContext) -> Refine: def test_constructor_args(mock_refine_service_context: ServiceContext) -> None: with pytest.raises(ValueError): - # cant construct refine with both streaming and answer filtering + # can't construct refine with both streaming and answer filtering Refine( service_context=mock_refine_service_context, streaming=True, structured_answer_filtering=True, ) with pytest.raises(ValueError): - # cant construct refine with a program factory but not answer filtering + # can't construct refine with a program factory but not answer filtering Refine( service_context=mock_refine_service_context, program_factory=lambda _: MockRefineProgram({}), diff --git a/tests/text_splitter/test_sentence_splitter.py b/tests/text_splitter/test_sentence_splitter.py index c3359538c80581e3a3bad9bb755d6a3f82180fcf..507dfa8624aa9b10503dd06a259baa7c49f2dcaf 100644 --- a/tests/text_splitter/test_sentence_splitter.py +++ b/tests/text_splitter/test_sentence_splitter.py @@ -60,7 +60,7 @@ def test_split_with_metadata(english_text: str) -> None: def test_edge_case() -> None: """Test case from: https://github.com/jerryjliu/llama_index/issues/7287""" - text = "\n\nMarch 2020\n\nL&D Metric (Org) - 2.92%\n\n| Training Name | Catergory | Duration (hrs) | Invitees | Attendance | Target Training Hours | Actual Training Hours | Adoption % |\n| ---------------------------------------------------------------------------------------------------------------------- | --------------- | -------------- | -------- | ---------- | --------------------- | --------------------- | ---------- |\n| Overview of Data Analytics | Technical | 1 | 23 | 10 | 23 | 10 | 43.5 |\n| Sales & Learning Best Practices - Introduction to OTT Platforms | Technical | 0.5 | 16 | 12 | 8 | 6 | 75 |\n| Leading Through OKRs | Lifeskill | 1 | 1 | 1 | 1 | 1 | 100 |\n| COVID: Lockdown Awareness Session | Lifeskill | 2 | 1 | 1 | 2 | 2 | 100 |\n| Navgati Interview | Lifeskill | 2 | 6 | 6 | 12 | 12 | 100 |\n| leadership Summit | Leadership | 18 | 42 | 42 | 756 | 756 | 100 |\n| AWS - AI/ML - Online Conference | Project Related | 15 | 2 | 2 | 30 | 30 | 100 |\n" # noqa + text = "\n\nMarch 2020\n\nL&D Metric (Org) - 2.92%\n\n| Training Name | Category | Duration (hrs) | Invitees | Attendance | Target Training Hours | Actual Training Hours | Adoption % |\n| ---------------------------------------------------------------------------------------------------------------------- | --------------- | -------------- | -------- | ---------- | --------------------- | --------------------- | ---------- |\n| Overview of Data Analytics | Technical | 1 | 23 | 10 | 23 | 10 | 43.5 |\n| Sales & Learning Best Practices - Introduction to OTT Platforms | Technical | 0.5 | 16 | 12 | 8 | 6 | 75 |\n| Leading Through OKRs | Lifeskill | 1 | 1 | 1 | 1 | 1 | 100 |\n| COVID: Lockdown Awareness Session | Lifeskill | 2 | 1 | 1 | 2 | 2 | 100 |\n| Navgati Interview | Lifeskill | 2 | 6 | 6 | 12 | 12 | 100 |\n| leadership Summit | Leadership | 18 | 42 | 42 | 756 | 756 | 100 |\n| AWS - AI/ML - Online Conference | Project Related | 15 | 2 | 2 | 30 | 30 | 100 |\n" # noqa splitter = SentenceSplitter(tokenizer=tiktoken.get_encoding("gpt2").encode) chunks = splitter.split_text(text) assert len(chunks) == 2