LLM Benchmarks
Detailed information about the benchmarks used to evaluate language models in our leaderboard.
MMLU
Multi-task language understanding benchmark focused on evaluating models' general knowledge and reasoning abilities across a wide range of academic subjects
Click to view details
MMLU Pro
Academic benchmark for evaluating language understanding models. Similar to MMLU, it falls under multi-task language understanding but with greater emphasis on more challenging and reasoning-based tasks
Click to view details
MMMU
Multimodal understanding and reasoning benchmark for expert general AI, covering disciplines such as art & design, business, science, health & medicine, humanities & social sciences, and technology & engineering
Click to view details
HellaSwag
Common sense natural language inference benchmark focused on sentence completion and assessing models' ability to understand context and reason about everyday situations
Click to view details
HumanEval
Code generation benchmark focused on evaluating language models' ability to generate functionally correct code from docstrings
Click to view details
MATH
Mathematical word problem solving benchmark focused on evaluating models' mathematical reasoning and problem-solving abilities at competition level
Click to view details
MATH500
Mathematical reasoning benchmark focused on evaluating AI models' ability to solve high school level math problems requiring logical reasoning
Click to view details
GPQA
Graduate-level question answering benchmark focused on evaluating models' general question answering performance in STEM fields requiring deep understanding and reasoning
Click to view details
GPQA Diamond
Graduate-level question answering benchmark, more specifically, a more challenging subset focused on high-confidence expert-verified questions
Click to view details
IFEval
Instruction-following benchmark for large language models focused on evaluating models' ability to follow specific natural language instructions
Click to view details