作者: Jakub Rusinowski · 最后更新: 2026年6月15日
DeepSeek's April 24, 2026 preview release, MIT licensed with a 1M-token context window. V4-Pro (1.6T total / 49B active) is server-class only — not runnable on consumer hardware even heavily quantized. V4-Flash (284B total / 13B active) is workstation-tier but local support is still WIP: llama.cpp has no merged upstream support as of June 2026 (only experimental forks), and Ollama's listings appear to route to cloud-hosted inference rather than a true local download. DeepSeek R2 remains unreleased/rumored as of June 15, 2026 and is intentionally not listed.
| DeepSeek V4-Flash | Min 140 GB VRAM · Q4 (experimental) · 1,000,000 ctx · ollama run deepseek-v4-flash (cloud-hosted on Ollama; local = WIP llama.cpp forks only) |
| DeepSeek V4-Pro | Min 400 GB VRAM · Q2 (experimental, datacenter only) · 1,000,000 ctx · |
Install Ollama then run: ollama run deepseek-v4-flash (cloud-hosted on Ollama; local = WIP llama.cpp forks only)
Minimum VRAM: 140 GB. For best results use Q4_K_M quantization.
DeepSeek V4 needs about 140 GB VRAM at Q4 (experimental) quantization for its smallest variant. Variants: DeepSeek V4-Flash (140 GB, Q4 (experimental)); DeepSeek V4-Pro (400 GB, Q2 (experimental, datacenter only)). On Apple Silicon, unified memory counts toward this requirement.
DeepSeek V4's smallest variant needs about 140 GB, which exceeds a single RTX 4090 (24 GB). Use multiple GPUs, a higher-VRAM card, or Apple Silicon with large unified memory.
Q4_K_M is the best balance of quality and VRAM for DeepSeek V4 in most cases. Choose Q8_0 for near-lossless quality if you have spare VRAM, or smaller quants (Q3/Q2) only when memory is tight.
Install Ollama, then run: ollama run deepseek-v4-flash (cloud-hosted on Ollama; local = WIP llama.cpp forks only). This downloads DeepSeek V4 and starts a local, OpenAI-compatible endpoint — no internet connection is needed after the initial download.