Magicoder-OSS-Instruct-75K — LLM code Dataset

Uses real open-source code snippets as seeds to generate diverse coding problems and solutions with GPT-4. The OSS-Instruct method produces more realistic, grounded coding tasks than purely synthetic approaches. Magicoder-S-DS-6.7B trained on this data surpassed GPT-3.5-Turbo on HumanEval.

Dataset Details

Providerise-uiuc
Categorycode
Size75k Rows
LicenseMIT
Downloads250k
TagsCode-Generation, GPT-4, OSS-Grounded, Python
from datasets import load_dataset
ds = load_dataset("ise-uiuc/magicoder-oss-instruct")

← All Datasets | Fine-Tuning Guide