Table of Contents
Fetching ...

Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification

Tianyi Bai, Zengjie Hu, Fupeng Sun, Jiantao Qiu, Yizhen Jiang, Guangxin He, Bohan Zeng, Conghui He, Binhang Yuan, Wentao Zhang

TL;DR

This work tackles the limitation of static visual representations in multimodal models by introducing Visual Token Scaling with Verification (VTS-V), a framework that enables iterative, verifier-guided visual reasoning at inference time. It casts visual reasoning as a Markov Decision Process with a reasoner that proposes visual actions and a verifier trained via multi-step Direct Preference Optimization to assess action quality and determine when to stop. The authors provide a dedicated VTS dataset with supervised trajectories (VTS-SFT) and preference data (VTS-DPO) to train both components, and demonstrate state-of-the-art performance on challenging benchmarks like BLINK, V$^*$Bench, MMStar, and MathVista. The results show that dynamic, tool-augmented reasoning with a principled verifier yields more accurate and interpretable reasoning traces, and generalizes across both closed- and open-source models, marking a step toward grounded, context-aware visual reasoning in next-generation MLLMs.

Abstract

Multi-modal large language models (MLLMs) have achieved remarkable capabilities by integrating visual perception with language understanding, enabling applications such as image-grounded dialogue, visual question answering, and scientific analysis. However, most MLLMs adopt a static inference paradigm, encoding the entire image into fixed visual tokens upfront, which limits their ability to iteratively refine understanding or adapt to context during inference. This contrasts sharply with human perception, which is dynamic, selective, and feedback-driven. In this work, we introduce a novel framework for inference-time visual token scaling that enables MLLMs to perform iterative, verifier-guided reasoning over visual content. We formulate the problem as a Markov Decision Process, involving a reasoner that proposes visual actions and a verifier, which is trained via multi-step Direct Preference Optimization (DPO), that evaluates these actions and determines when reasoning should terminate. To support this, we present a new dataset, VTS, comprising supervised reasoning trajectories (VTS-SFT) and preference-labeled reasoning comparisons (VTS-DPO). Our method significantly outperforms existing approaches across diverse visual reasoning benchmarks, offering not only improved accuracy but also more interpretable and grounded reasoning processes. These results demonstrate the promise of dynamic inference mechanisms for enabling fine-grained, context-aware visual reasoning in next-generation MLLMs.

Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification

TL;DR

This work tackles the limitation of static visual representations in multimodal models by introducing Visual Token Scaling with Verification (VTS-V), a framework that enables iterative, verifier-guided visual reasoning at inference time. It casts visual reasoning as a Markov Decision Process with a reasoner that proposes visual actions and a verifier trained via multi-step Direct Preference Optimization to assess action quality and determine when to stop. The authors provide a dedicated VTS dataset with supervised trajectories (VTS-SFT) and preference data (VTS-DPO) to train both components, and demonstrate state-of-the-art performance on challenging benchmarks like BLINK, VBench, MMStar, and MathVista. The results show that dynamic, tool-augmented reasoning with a principled verifier yields more accurate and interpretable reasoning traces, and generalizes across both closed- and open-source models, marking a step toward grounded, context-aware visual reasoning in next-generation MLLMs.

Abstract

Multi-modal large language models (MLLMs) have achieved remarkable capabilities by integrating visual perception with language understanding, enabling applications such as image-grounded dialogue, visual question answering, and scientific analysis. However, most MLLMs adopt a static inference paradigm, encoding the entire image into fixed visual tokens upfront, which limits their ability to iteratively refine understanding or adapt to context during inference. This contrasts sharply with human perception, which is dynamic, selective, and feedback-driven. In this work, we introduce a novel framework for inference-time visual token scaling that enables MLLMs to perform iterative, verifier-guided reasoning over visual content. We formulate the problem as a Markov Decision Process, involving a reasoner that proposes visual actions and a verifier, which is trained via multi-step Direct Preference Optimization (DPO), that evaluates these actions and determines when reasoning should terminate. To support this, we present a new dataset, VTS, comprising supervised reasoning trajectories (VTS-SFT) and preference-labeled reasoning comparisons (VTS-DPO). Our method significantly outperforms existing approaches across diverse visual reasoning benchmarks, offering not only improved accuracy but also more interpretable and grounded reasoning processes. These results demonstrate the promise of dynamic inference mechanisms for enabling fine-grained, context-aware visual reasoning in next-generation MLLMs.

Paper Structure

This paper contains 35 sections, 6 theorems, 25 equations, 11 figures, 4 tables, 1 algorithm.

Key Result

Lemma 3.1

The reasoner $\texttt{R}_{\theta_0}$ stops at the reasoning step $h$ if

Figures (11)

  • Figure 1: Iterative Visual Reasoning with VTS-V. Our framework equips both open-source and closed-source models with dynamic visual token scaling and step-wise verification to solve complex visual tasks. The example shows how VTS-V: (1) decomposes questions into executable steps, (2) invokes vision tools, and (3) iteratively refines answers via verifier feedback, achieving correct results. In contrast, vanilla models fail to ground detailed visual operations without token scaling, leading to incorrect answers.
  • Figure 2: Pipeline for Synthetic Data Generation and Curation in VTS-V.Our data construction process consists of three stages: (1) generating multi-step reasoning trajectories with visual tool calls, (2) filtering out incorrect trajectories using an LLM-as-a-judge framework, and (3) creating contrastive (correct vs. incorrect) trajectory pairs for multi-step DPO training.
  • Figure 3: Examples of the generated DPO data.
  • Figure 4: Additional examples of the generated DPO data.
  • Figure 5: Additional examples of the generated DPO data.
  • ...and 6 more figures

Theorems & Definitions (12)

  • Lemma 3.1
  • Theorem 3.2: Reasoning steps characterization
  • Proposition A.1
  • proof : Proof of Proposition \ref{['prop:backward induc']}
  • Lemma A.2: Proposition 7.16 and Theorem 15.3 of zhang2023mathematical
  • proof : Proof of Lemma \ref{['lem:stopping rule']}
  • Definition A.3: Martingale, supermartingale, and submartingale
  • Definition A.4: Stopping time
  • Theorem A.5: Martingale convergence theorem, Theorem 4.2.11 of durrett2019probability
  • Theorem A.6: Formal version of Theorem \ref{['thm:finite stopping time']}
  • ...and 2 more