HumanEval — LLM evaluation Dataset

OpenAI's hand-crafted Python coding benchmark: 164 programming problems with function signatures, docstrings, and unit tests. Models are evaluated by Pass@k — the probability of generating at least one correct solution in k tries. The definitive benchmark for code generation ability.

Dataset Details

ProviderOpenAI
Categoryevaluation
Size164 Problems
LicenseMIT
Downloads5M
TagsBenchmark, Python, Code-Correctness, Unit-Tests
from datasets import load_dataset
ds = load_dataset("OpenAI/humaneval")

← All Datasets | Fine-Tuning Guide