D
distributed-training
Projects with this topic
-
OpenLIT is an open-source LLM Observability tool built on OpenTelemetry.
📈 🔥 Monitor GPU performance, LLM traces with input and output metadata, and metrics like cost, tokens, and user interactions along with complete APM for LLM Apps.🖥 ️Updated -
🔧 🔗 https://github.com/NVIDIA/apexA PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Updated -
https://github.com/coqui-ai/snakepit
🐍 Coqui's machine learning job schedulerUpdated