Smoltalk — LLM instruction Dataset

A high-quality 2025 synthetic dataset created to train the SmolLM2 family of small language models. 1 million examples covering diverse tasks, generated with careful quality filtering. Demonstrates that small models (1.7B, 360M) can achieve strong performance with the right training data.

Dataset Details

ProviderHuggingFaceTB
Categoryinstruction
Size1M Rows
LicenseApache 2.0
Downloads600k
TagsSynthetic, 2025, Small-Models, High-Quality
from datasets import load_dataset
ds = load_dataset("HuggingFaceTB/smoltalk")

← All Datasets | Fine-Tuning Guide