The Stack v2 — LLM code Dataset

The largest open code dataset ever assembled: 900 billion tokens across 600+ programming languages sourced from GitHub, with an opt-out mechanism for developers. Powers StarCoder2 and is the go-to pretraining corpus for code LLMs. Includes permissively-licensed repos with deduplication and near-deduplication passes.

Dataset Details

ProviderBigCode
Categorycode
Size900B Tokens
LicenseVarious (per-file)
Downloads350k
TagsSource-Code, 600+ Languages, GitHub, BigCode
from datasets import load_dataset
ds = load_dataset("BigCode/the-stack-v2")

← All Datasets | Fine-Tuning Guide