GPU Benchmark Matrix

Estimated tokens/sec for all 139 runnable local LLM variants in our library across all 55 GPUs, using a memory-bandwidth roofline model (efficiency factor 1.0), plus community-measured benchmarks. Top models on an NVIDIA RTX 4090 (24 GB):

BitNet b1.58 (3B, 1.58-bit)	~400 tokens/sec (est.)
Llama 4 (17B active / 109B total, Q4_K_M)	~400 tokens/sec (est.)
Gemma 3 (1B, Q4_K_M)	~400 tokens/sec (est.)
Gemma 4 (26B (4B active, MoE), Q4_K_M)	~400 tokens/sec (est.)
Llama 3.2 Family (1B, Q4_K_M)	~400 tokens/sec (est.)
Llama 3.2 Family (3B, Q4_K_M)	~400 tokens/sec (est.)
Qwen 3 (30B (3B active), Q4_K_M)	~400 tokens/sec (est.)
Phi-4 Mini (3.8B, Q4_K_M)	~400 tokens/sec (est.)
StarCoder 2 (3B, Q4_K_M)	~400 tokens/sec (est.)
SmolLM2 (1.7B, Q4_K_M)	~400 tokens/sec (est.)
SmolLM2 (360M, Q4_K_M)	~400 tokens/sec (est.)
Falcon 3 (3B, Q4_K_M)	~400 tokens/sec (est.)
Qwen 3.5 (Legacy Listing — Unverified) (122B/10B active, Q4_K_M)	~400 tokens/sec (est.)
Qwen3-Coder (80B/3B active, Q4_K_M)	~400 tokens/sec (est.)
Aya 3B (Tiny Aya) (3B, Q4_K_M)	~400 tokens/sec (est.)
EXAONE 3.5 (2.4B, Q4_K_M)	~400 tokens/sec (est.)
Gemma 3n (2B, Q4_K_M)	~400 tokens/sec (est.)
Ministral (3B, Q4_K_M)	~400 tokens/sec (est.)
Qwen 3.5 (0.8B, Q4_K_M)	~400 tokens/sec (est.)
Cogito v1 (3B, Q4_K_M)	~400 tokens/sec (est.)

Rows labeled "measured · community" are real user submissions, reviewed for plausibility (community-reported, not lab-verified); everything else is an estimate derived from a documented formula — see the full matrix for every GPU, VRAM-fit status, and the methodology. Ran a model yourself? Submit your tokens/sec on the benchmarks page — the open community dataset is published under CC BY 4.0 at /measured-benchmarks.json.