Written by Jakub Rusinowski · Last updated June 15, 2026
Google's mobile-first multimodal model family. Uses the novel MatFormer nested architecture — a single model file contains multiple sub-models (E2B/E4B) that can run at different sizes. Processes text, images, audio, and video. Runs on phones without internet.
| Gemma 3n E2B | Min 2 GB VRAM · Q4_K_M · 32,768 ctx · ollama run gemma3n:e2b |
| Gemma 3n E4B | Min 3 GB VRAM · Q4_K_M · 32,768 ctx · ollama run gemma3n:e4b |
Install Ollama then run: ollama run gemma3n:e2b
Minimum VRAM: 2 GB. For best results use Q4_K_M quantization.
Gemma 3n needs about 2 GB VRAM at Q4_K_M quantization for its smallest variant. Variants: Gemma 3n E2B (2 GB, Q4_K_M); Gemma 3n E4B (3 GB, Q4_K_M). On Apple Silicon, unified memory counts toward this requirement.
Yes — Gemma 3n runs on an RTX 4090 (24 GB) and other 24 GB cards such as the RTX 3090. Smaller variants also fit comfortably on 8–16 GB GPUs at Q4_K_M.
Q4_K_M is the best balance of quality and VRAM for Gemma 3n in most cases. Choose Q8_0 for near-lossless quality if you have spare VRAM, or smaller quants (Q3/Q2) only when memory is tight.
Install Ollama, then run: ollama run gemma3n:e2b. This downloads Gemma 3n and starts a local, OpenAI-compatible endpoint — no internet connection is needed after the initial download.