Projects with this topic
-
🔧 🔗 https://github.com/tensorzero/llmgym LLM Gym is a unified environment interface for developing and benchmarking LLM applications that learn from feedback. Think gym for LLM agents.As the space of benchmar
Updated -
https://github.com/princeton-nlp/SWE-bench [ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
🔗 https://www.swebench.com/Updated -
Updated
-
https://github.com/princeton-nlp/CharXiv CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Updated -
🔧 🔗 https://github.com/gin-gonic/FrameworkBenchmarksSource code for the framework benchmarking project
Updated -
Repository for AI Model Benchmarking
Updated -
https://github.com/THUDM/LongBench [ACL 2024] LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Updated -
https://github.com/THUDM/LongCite LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
Updated