MT-Bench — LLM evaluation Dataset

Multi-turn benchmark with 80 challenging questions across 8 domains (writing, reasoning, math, coding, STEM, humanities, roleplay, extraction). Uses GPT-4-as-judge to score responses, enabling automated preference evaluation. Powers the Chatbot Arena leaderboard methodology.

Dataset Details

Providerlmsys
Categoryevaluation
Size80 Questions
LicenseApache 2.0
Downloads900k
TagsBenchmark, Multi-Turn, LLM-Judge, GPT-4, Chatbot-Arena
from datasets import load_dataset
ds = load_dataset("lmsys/mt-bench")

← All Datasets | Fine-Tuning Guide