Table of Contents
Fetching ...

V-Thinker: Interactive Thinking with Images

Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, Chong Sun, Chen Li, Jing Lyu, Honggang Zhang

TL;DR

V-Thinker tackles the challenge of teaching vision-centric, interactive thinking to large multimodal models. It introduces a Data Evolution Flywheel to auto-generate diverse, high-quality interactive data and a Visual Progressive Training Curriculum to align perception with interactive reasoning, complemented by VTBench for expert-verified evaluation. Empirical results show consistent gains over strong baselines in both general and interactive reasoning tasks, and the flywheel demonstrates scalable knowledge expansion across domains. This work provides a data-driven pathway toward robust, image‑interactive reasoning in multimodal systems with broad practical implications.

Abstract

Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising "Thinking with Images" paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.

V-Thinker: Interactive Thinking with Images

TL;DR

V-Thinker tackles the challenge of teaching vision-centric, interactive thinking to large multimodal models. It introduces a Data Evolution Flywheel to auto-generate diverse, high-quality interactive data and a Visual Progressive Training Curriculum to align perception with interactive reasoning, complemented by VTBench for expert-verified evaluation. Empirical results show consistent gains over strong baselines in both general and interactive reasoning tasks, and the flywheel demonstrates scalable knowledge expansion across domains. This work provides a data-driven pathway toward robust, image‑interactive reasoning in multimodal systems with broad practical implications.

Abstract

Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising "Thinking with Images" paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.

Paper Structure

This paper contains 66 sections, 10 equations, 12 figures, 17 tables, 1 algorithm.

Figures (12)

  • Figure 1: The three paradigms of vision-centric reasoning.
  • Figure 2: Representative examples of V-Thinker's knowledge-driven synthesis spanning diverse reasoning domains.
  • Figure 3: The rendering process from code to image.
  • Figure 4: The Data Evolution Flywheel framework: Left: knowledge-driven evolution mechanism. Middle: coordinated calibration and progressive expansion stages. Right: representative synthetic QA instances generated through the flywheel.
  • Figure 5: A representative sample from the synthesized dataset V-Interaction-400K ($\mathcal{D}$).
  • ...and 7 more figures