GLM-5 / GLM-5.1 — Local AI Model by Zhipu AI

Written by Jakub Rusinowski · Last updated June 15, 2026

Zhipu AI's fifth-generation model series, released under MIT license for maximum flexibility. GLM-5 and GLM-5.1 excel at agentic workflows, tool calling, and long-context reasoning. The MIT license makes them one of the most commercially permissive frontier-class models available, with zero restrictions on commercial use or distribution.

Hardware Requirements

GLM-5 9B	Min 6 GB VRAM · Q4_K_M · 128,000 ctx · `ollama run hf.co/THUDM/GLM-5-9B-Chat-Q4_K_M`
GLM-5 32B	Min 20 GB VRAM · Q4_K_M · 128,000 ctx · `ollama run hf.co/THUDM/GLM-5-32B-Chat-Q4_K_M`
GLM-5.1 72B	Min 40 GB VRAM · Q4_K_M · 128,000 ctx · `ollama run hf.co/THUDM/GLM-5.1-72B-Chat-Q4_K_M`

How to Run Locally

Install Ollama then run: ollama run hf.co/THUDM/GLM-5-9B-Chat-Q4_K_M

Minimum VRAM: 6 GB. For best results use Q4_K_M quantization.

GLM-5 / GLM-5.1 — Frequently Asked Questions

How much VRAM does GLM-5 / GLM-5.1 need?

GLM-5 / GLM-5.1 needs about 6 GB VRAM at Q4_K_M quantization for its smallest variant. Variants: GLM-5 9B (6 GB, Q4_K_M); GLM-5 32B (20 GB, Q4_K_M); GLM-5.1 72B (40 GB, Q4_K_M). On Apple Silicon, unified memory counts toward this requirement.

Can I run GLM-5 / GLM-5.1 on an RTX 4090 (24 GB)?

Yes — GLM-5 / GLM-5.1 runs on an RTX 4090 (24 GB) and other 24 GB cards such as the RTX 3090. Smaller variants also fit comfortably on 8–16 GB GPUs at Q4_K_M.

What quantization should I use for GLM-5 / GLM-5.1?

Q4_K_M is the best balance of quality and VRAM for GLM-5 / GLM-5.1 in most cases. Choose Q8_0 for near-lossless quality if you have spare VRAM, or smaller quants (Q3/Q2) only when memory is tight.

How do I run GLM-5 / GLM-5.1 with Ollama?

Install Ollama, then run: ollama run hf.co/THUDM/GLM-5-9B-Chat-Q4_K_M. This downloads GLM-5 / GLM-5.1 and starts a local, OpenAI-compatible endpoint — no internet connection is needed after the initial download.

Can I Run GLM-5 / GLM-5.1 on My GPU?

GLM-5 / GLM-5.1 on Apple M2 Ultra