Table of Contents
Fetching ...

Qwen3-VL Technical Report

TL;DR

Qwen3-VL tackles the challenge of robust long-context multimodal reasoning across text, images, and video by enabling a native -token context window. It uses a three-module vision–language foundation (vision encoder, merger, LLM) enhanced with Interleaved MRoPE for spatial-temporal modeling, DeepStack fusion of multi-layer visual tokens into the LLM, and text-based video timestamps for precise temporal grounding. The model family comprises dense sizes (, , , ) and MoE variants (-A3B, -A22B), trained in staged pretraining (Stage 0: merger-only; Stage 1: end-to-end multimodal pretraining on ~1T tokens) across diverse data (OCR, document parsing, grounding, 3D, STEM, code, video, and agent data) to build both long-context multimodal reasoning and strong language proficiency. Evaluations cover general and multimodal benchmarks, including MMMU and MathVista/MathVision, with ablations showing contributions from the vision encoder, DeepStack, and Needle-in-a-Haystack, positioning Qwen3-VL as a foundational engine for image-grounded reasoning and multimodal code intelligence, deployable on Alibaba Cloud.

Abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

Paper Structure

This paper contains 68 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: The Qwen3-VL framework integrates a vision encoder and a language model decoder to process multimodal inputs, including text, images, and video. The vision encoder is specifically designed to handle dynamic, native-resolution visual inputs, mapping them to visual tokens of variable length. To enhance perceptual capability and preserve rich visual information, we incorporate the pioneering DeepStack mechanism, which injects visual tokens from multiple layers of the vision encoder into corresponding layers of the LLM. Furthermore, we adopt Interleaved MRoPE to encode positional information for multimodal inputs with a balanced frequency spectrum, and introduce text-based timestamp tokens to more effectively capture the temporal structure of video sequences.
  • Figure 2: Multilingual OCR performance of our model on a self-built test set. The model achieves over 70% accuracy on 32 out of 39 supported languages, demonstrating strong and usable multilingual capabilities.
  • Figure 3: Needle-in-a-Haystack performance heatmap for Qwen3-VL-235B-A22B-Instruct across varying video durations and needle positions. Each cell shows accuracy (%) for locating and answering questions about the inserted "needle" frame.
  • Figure :