SlimPajama — A heavily deduplicated and cleaned version of RedPajama. It removes low-quality and repetitive data … (Apache 2.0)
Databricks Dolly 15k — 15,000 high-quality human-generated prompt/response pairs. Specifically designed for commercial use … (CC-BY-SA 3.0)
Stanford Alpaca — The foundational self-instruct dataset that launched a thousand fine-tunes. 52k instruction-followin… (CC-BY-NC 4.0)
ShareGPT 52K — Real multi-turn conversations scraped from ShareGPT.com — a site where users shared their best ChatG… (CC-BY-NC 4.0)
WizardLM Evol Instruct 70k — Uses the 'Evol-Instruct' method to progressively rewrite simple instructions into increasingly compl… (Apache 2.0)
LIMA: Less Is More for Alignment — Landmark alignment research showing that just 1,000 carefully curated examples rival GPT-4 in instru… (CC-BY-NC-SA 4.0)
Smoltalk — A high-quality 2025 synthetic dataset created to train the SmolLM2 family of small language models. … (Apache 2.0)
The Stack v2 — The largest open code dataset ever assembled: 900 billion tokens across 600+ programming languages s… (Various (per-file))
Magicoder-OSS-Instruct-75K — Uses real open-source code snippets as seeds to generate diverse coding problems and solutions with … (MIT)