FineWeb — LLM pretraining Dataset

The gold standard for English web data. Massive, cleaned, and deduplicated dataset derived from CommonCrawl. Used to train some of the best open models including Llama 3.

Dataset Details

ProviderHuggingFaceFW
Categorypretraining
Size15 TB
LicenseODC-By 1.0
Downloads2.5M
TagsWeb, English, Cleaned, CommonCrawl
from datasets import load_dataset
ds = load_dataset("HuggingFaceFW/fineweb")

← All Datasets | Fine-Tuning Guide