LIMA: Less Is More for Alignment — LLM instruction Dataset

Landmark alignment research showing that just 1,000 carefully curated examples rival GPT-4 in instruction-following quality. Challenges the 'more data is always better' assumption. LIMA 65B was competitive with GPT-4 on human preference tests despite only 1k training examples.

Dataset Details

ProviderMeta / GAIR
Categoryinstruction
Size1k Rows
LicenseCC-BY-NC-SA 4.0
Downloads300k
TagsCurated, Quality-over-Quantity, Research, Alignment
from datasets import load_dataset
ds = load_dataset("Meta / GAIR/lima")

← All Datasets | Fine-Tuning Guide