Skip to content
Snippets Groups Projects
Unverified Commit 11786caa authored by Marcus Schiesser's avatar Marcus Schiesser Committed by GitHub
Browse files

docs: document duplicate strategies (#12457)

parent 16140579
Branches
Tags
No related merge requests found
...@@ -192,15 +192,16 @@ def arun_transformations_wrapper( ...@@ -192,15 +192,16 @@ def arun_transformations_wrapper(
class DocstoreStrategy(str, Enum): class DocstoreStrategy(str, Enum):
"""Document de-duplication strategy. """Document de-duplication de-deduplication strategies work by comparing the hashes or ids stored in the document store.
They require a document store to be set which must be persisted across pipeline runs.
Attributes: Attributes:
UPSERTS: UPSERTS:
('upserts') Use upserts to handle duplicates. ('upserts') Use upserts to handle duplicates. Checks if the a document is already in the doc store based on its id. If it is not, or if the hash of the document is updated, it will update the document in the doc store and run the transformations.
DUPLICATES_ONLY: DUPLICATES_ONLY:
('duplicates_only') Only handle duplicates. ('duplicates_only') Only handle duplicates. Checks if the hash of a document is already in the doc store. Only then it will add the document to the doc store and run the transformations
UPSERTS_AND_DELETE: UPSERTS_AND_DELETE:
('upserts_and_delete') Use upserts and delete to handle duplicates. ('upserts_and_delete') Use upserts and delete to handle duplicates. Like the upsert strategy but it will also delete non-existing documents from the doc store
""" """
UPSERTS = "upserts" UPSERTS = "upserts"
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment