LLM Glossary

Plain-English definitions for all the technical terms you'll encounter when running AI locally.

VRAMVideo RAM — dedicated GPU memory. Determines which models you can run.
QuantizationCompressing weights from 16-bit to 4-bit or 8-bit. Reduces VRAM by 2–4×.
GGUFStandard model format for llama.cpp, Ollama, and LM Studio.
Context WindowMaximum tokens a model processes at once. Ranges from 2K to 1M.
LoRALow-Rank Adaptation — efficient fine-tuning on a small set of weights.
RAGRetrieval-Augmented Generation — grounding an LLM with external documents.
Tokens/secLLM inference speed. 10+ t/s is usable; 30+ t/s feels fast.
OllamaFree, open-source CLI tool for downloading and running LLMs locally.