MMLU (Massive Multitask Language Understanding) — LLM evaluation Dataset

The gold-standard multi-subject benchmark: 16k multiple-choice questions spanning 57 subjects from STEM and humanities to law, medicine, and social sciences. Every major LLM is evaluated on MMLU. A model scoring >80% is considered strong; GPT-4 scores ~87%, Llama 3 70B ~82%.

Dataset Details

Providercais
Categoryevaluation
Size16k Questions
LicenseMIT
Downloads12M
TagsBenchmark, Multiple-Choice, 57-Subjects, STEM, Reasoning
from datasets import load_dataset
ds = load_dataset("cais/mmlu")

← All Datasets | Fine-Tuning Guide