Written by Jakub Rusinowski · Last updated June 15, 2026
Zhipu AI's fifth-generation model series, released under MIT license for maximum flexibility. GLM-5 and GLM-5.1 excel at agentic workflows, tool calling, and long-context reasoning. The MIT license makes them one of the most commercially permissive frontier-class models available, with zero restrictions on commercial use or distribution.
| GLM-5 9B | Min 6 GB VRAM · Q4_K_M · 128,000 ctx · ollama run hf.co/THUDM/GLM-5-9B-Chat-Q4_K_M |
| GLM-5 32B | Min 20 GB VRAM · Q4_K_M · 128,000 ctx · ollama run hf.co/THUDM/GLM-5-32B-Chat-Q4_K_M |
| GLM-5.1 72B | Min 40 GB VRAM · Q4_K_M · 128,000 ctx · ollama run hf.co/THUDM/GLM-5.1-72B-Chat-Q4_K_M |
Install Ollama then run: ollama run hf.co/THUDM/GLM-5-9B-Chat-Q4_K_M
Minimum VRAM: 6 GB. For best results use Q4_K_M quantization.
GLM-5 / GLM-5.1 needs about 6 GB VRAM at Q4_K_M quantization for its smallest variant. Variants: GLM-5 9B (6 GB, Q4_K_M); GLM-5 32B (20 GB, Q4_K_M); GLM-5.1 72B (40 GB, Q4_K_M). On Apple Silicon, unified memory counts toward this requirement.
Yes — GLM-5 / GLM-5.1 runs on an RTX 4090 (24 GB) and other 24 GB cards such as the RTX 3090. Smaller variants also fit comfortably on 8–16 GB GPUs at Q4_K_M.
Q4_K_M is the best balance of quality and VRAM for GLM-5 / GLM-5.1 in most cases. Choose Q8_0 for near-lossless quality if you have spare VRAM, or smaller quants (Q3/Q2) only when memory is tight.
Install Ollama, then run: ollama run hf.co/THUDM/GLM-5-9B-Chat-Q4_K_M. This downloads GLM-5 / GLM-5.1 and starts a local, OpenAI-compatible endpoint — no internet connection is needed after the initial download.