作者: Jakub Rusinowski · 最后更新: 2026年6月15日
These are the strongest local models that fit entirely in 8 GB of VRAM, ranked by capability, with the quantization level and estimated tokens/sec needed to fit.
| Granite 3.0 — Granite 3.0 8B Instruct | Q4_K_M · 5.5 GB · ~81 tok/s on NVIDIA GeForce RTX 5060 |
| Qwen 2.5 Family — Qwen 2.5 7B Instruct | Q4_K_M · 4.8 GB · ~93 tok/s on NVIDIA GeForce RTX 5060 |
| Qwen 3 — Qwen 3 8B | Q4_K_M · 5.5 GB · ~81 tok/s on NVIDIA GeForce RTX 5060 |
| DeepSeek R1 — DeepSeek R1 Distill Llama 8B | Q4_K_M · 5.8 GB · ~77 tok/s on NVIDIA GeForce RTX 5060 |
| Qwen3-Coder — Qwen3-Coder 8B | Q4_K_M · 5.5 GB · ~81 tok/s on NVIDIA GeForce RTX 5060 |
| Qwen 3.5 (Legacy Listing — Unverified) — Qwen 3.5 7B | Q4_K_M · 4.8 GB · ~93 tok/s on NVIDIA GeForce RTX 5060 |
| InternLM 3 — InternLM 3 8B Instruct | Q4_K_M · 5.5 GB · ~81 tok/s on NVIDIA GeForce RTX 5060 |
| Qwen 3.5 — Qwen 3.5 9B | Q4_K_M · 6.6 GB · ~68 tok/s on NVIDIA GeForce RTX 5060 |
| Yi 1.5 Family — Yi 1.5 9B Chat | Q4_K_M · 6.2 GB · ~72 tok/s on NVIDIA GeForce RTX 5060 |
| Falcon 3 — Falcon 3 10B Instruct | Q4_K_M · 6.5 GB · ~69 tok/s on NVIDIA GeForce RTX 5060 |
| GLM-4.7 / GLM-Z1 — GLM-4.7 9B | Q4_K_M · 6.2 GB · ~72 tok/s on NVIDIA GeForce RTX 5060 |
| GLM-5 / GLM-5.1 — GLM-5 9B | Q4_K_M · 6 GB · ~75 tok/s on NVIDIA GeForce RTX 5060 |
| Gemma 2 Family — Gemma 2 9B IT | Q4_K_M · 6.8 GB · ~66 tok/s on NVIDIA GeForce RTX 5060 |
| Llama 3.1 Family — Llama 3.1 8B Instruct | Q4_K_M · 6.5 GB · ~69 tok/s on NVIDIA GeForce RTX 5060 |
| IBM Granite 4.1 — Granite 4.1 8B | Q4_K_M · 5 GB · ~90 tok/s on NVIDIA GeForce RTX 5060 |
Granite 3.0, Qwen 2.5 Family, Qwen 3, DeepSeek R1, Qwen3-Coder all fit in 8 GB VRAM.
NVIDIA GeForce RTX 5060, NVIDIA GeForce RTX 4060, AMD Radeon RX 9060 XT 8GB, NVIDIA GeForce RTX 5060 Ti 8GB.