SlimPajama — LLM pretraining Dataset

A heavily deduplicated and cleaned version of RedPajama. It removes low-quality and repetitive data to improve training efficiency.

Dataset Details

ProviderCerebras
Categorypretraining
Size627B Tokens
LicenseApache 2.0
Downloads150k
TagsEnglish, Deduplicated, CommonCrawl
from datasets import load_dataset
ds = load_dataset("Cerebras/slimpajama")

← All Datasets | Fine-Tuning Guide