Cosmopedia — LLM pretraining Dataset

A dataset of synthetic textbooks, blog posts, and stories generated by Mixtral-8x7B. Designed to convey high-quality educational content.

Dataset Details

ProviderHuggingFaceFW
Categorypretraining
Size25B Tokens
LicenseApache 2.0
Downloads400k
TagsSynthetic, Education, Textbook
from datasets import load_dataset
ds = load_dataset("HuggingFaceFW/cosmopedia")

← All Datasets | Fine-Tuning Guide