Qwen 3.5: When Text, Vision, and Video Understand Each Other

Alibaba's Qwen 3.5 (397B) is the first truly native multimodal open-weight model — not a language model with a vision plugin, but a single architecture that thinks in text, images, and video simultaneously.

Most "multimodal" AI models are a lie — or at least an overstatement. What they actually are: a language model with an image encoder attached. The image encoder (typically a CLIP variant) converts visual inputs into token embeddings that the language model then processes. The image understanding happens in a separate module that was designed and trained independently from the language reasoning. This architecture works, but it has systematic weaknesses. The vision encoder and the language model were trained on different data with different objectives. They're aligned through fine-tuning, but t…

← All Articles