Qwen 3.5: When Text, Vision, and Video Understand Each Other

作者： Jakub Rusinowski · 最后更新： 2026年7月10日

Alibaba's Qwen 3.5 (397B) is the first truly native multimodal open-weight model — not a language model with a vision plugin, but a single architecture that thinks in text, images, and video simultaneously.

Native Multimodal Architecture
Benchmark Performance
The Practical Scale Problem
Real-World Multimodal Use Cases
The Apache 2.0 License
Why Native Multimodal Is the Future

Most "multimodal" AI models are a lie — or at least an overstatement. What they actually are: a language model with an image encoder attached. The image encoder (typically a CLIP variant) converts visual inputs into token embeddings that the language model then processes. The image understanding happens in a separate module that was designed and trained independently from the language reasoning. This architecture works, but it has systematic weaknesses. The vision encoder and the language model were trained on different data with different objectives. They're aligned through fine-tuning, but t…

← All Articles