Projects with this topic
-
🔧 🔗 https://github.com/modelscope/data-juicer Making data higher-quality, juicier, and more digestible for foundation models!🍎 🍋 🌽 ➡️ ➡️ 🍸 🍹 🍷 为大模型提供更高质量、更丰富、更易”消化“的数据!Updated -
-
🔧 🔗 https://github.com/HumanSignal/label-studioLabel Studio is a multi-type data labeling and annotation tool with standardized output format
Updated -
🔧 🔗 https://github.com/joanrod/ocr-vqganOCR-VQGAN, a discrete image encoder (tokenizer and detokenizer) for figure images in Paper2Fig100k dataset. Implementation of OCR Perceptual loss for clear text-within-imUpdated -
🔧 🔗 https://github.com/MultiTonic/thinking-datasetcreating a thinking dataset
Updated -
-
https://github.com/coqui-ai/TTS-recipes
🐸 TTS recipes for different datasetsUpdated -
https://github.com/lm-sys/llm-decontaminator Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"
Updated -
https://github.com/Farama-Foundation/Minari A standard format for offline reinforcement learning datasets, with popular reference datasets and related utilities
Updated -
https://github.com/run-llama/llama-datasets Github repo for storing LlamaDatasets
Updated