benchmark
Projects with this topic
-
https://github.com/princeton-nlp/SWE-bench [ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?
🔗 https://www.swebench.com/Updated -
Repository for AI Model Benchmarking
Updated -
https://github.com/princeton-nlp/CharXiv CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Updated -
https://github.com/THUDM/LongCite LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
Updated -
https://github.com/THUDM/LongBench [ACL 2024] LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Updated -
https://github.com/VictoriaMetrics/prometheus-benchmark Benchmark for Prometheus-compatible systems
Updated -
https://github.com/VictoriaMetrics/billy Billy benchmarks for VictoriaMetrics
Updated