Table of Contents
Fetching ...

V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

Jiazhou Zhou, Yucheng Chen, Hongyang Li, Qing Jiang, Hu Zhou, Ying-Cong Chen, Lei Zhang

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success, yet they remain prone to perception-related hallucinations in fine-grained tasks. This vulnerability arises from a fundamental limitation: their reasoning is largely restricted to the language domain, treating visual input as a static, reasoning-agnostic preamble rather than a dynamic participant. Consequently, current models act as passive observers, unable to re-examine visual details to ground their evolving reasoning states. To overcome this, we propose V-Reflection, a framework that transforms the MLLM into an active interrogator through a "think-then-look" visual reflection mechanism. During reasoning, latent states function as dynamic probes that actively interrogate the visual feature space, grounding each reasoning step for task-critical evidence. Our approach employs a two-stage distillation strategy. First, the Box-Guided Compression (BCM) module establishes stable pixel-to-latent targets through explicit spatial grounding. Next, a Dynamic Autoregressive Compression (DAC) module maps the model's hidden states into dynamic probes that interrogate the global visual feature map. By distilling the spatial expertise of the BCM teacher into the DAC student, V-Reflection internalizes the ability to localize task-critical evidence. During inference, both modules remain entirely inactive, maintaining a purely end-to-end autoregressive decoding in the latent space with optimal efficiency. Extensive experiments demonstrate the effectiveness of our V-Reflection across six perception-intensive benchmarks, significantly narrowing the fine-grained perception gap. Visualizations confirm that latent reasoning autonomously localizes task-critical visual evidence.

V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success, yet they remain prone to perception-related hallucinations in fine-grained tasks. This vulnerability arises from a fundamental limitation: their reasoning is largely restricted to the language domain, treating visual input as a static, reasoning-agnostic preamble rather than a dynamic participant. Consequently, current models act as passive observers, unable to re-examine visual details to ground their evolving reasoning states. To overcome this, we propose V-Reflection, a framework that transforms the MLLM into an active interrogator through a "think-then-look" visual reflection mechanism. During reasoning, latent states function as dynamic probes that actively interrogate the visual feature space, grounding each reasoning step for task-critical evidence. Our approach employs a two-stage distillation strategy. First, the Box-Guided Compression (BCM) module establishes stable pixel-to-latent targets through explicit spatial grounding. Next, a Dynamic Autoregressive Compression (DAC) module maps the model's hidden states into dynamic probes that interrogate the global visual feature map. By distilling the spatial expertise of the BCM teacher into the DAC student, V-Reflection internalizes the ability to localize task-critical evidence. During inference, both modules remain entirely inactive, maintaining a purely end-to-end autoregressive decoding in the latent space with optimal efficiency. Extensive experiments demonstrate the effectiveness of our V-Reflection across six perception-intensive benchmarks, significantly narrowing the fine-grained perception gap. Visualizations confirm that latent reasoning autonomously localizes task-critical visual evidence.

Paper Structure

This paper contains 22 sections, 5 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Conceptual comparison between traditional MLLMs and our V-Reflection. Current MLLMs' reasoning remains confined to the language domain, treating visual information as a reasoning-agnostic static input rather than an active driver of the thought process, thus leading to perception-related hallucinations (e.g., 'Kevlar') where the model prioritizes language priors over actual visual evidence. (b) Our framework internalizes a "think-then-look" visual self-reflection mechanism, where evolving latent states act as dynamic probes ($\mathbf{Q}_{dyn}$) to retrace global visual features. This mechanism retrieves task-specific evidence (e.g., accurately localizing the rubber glove), effectively correcting the reasoning trajectory for a precise answer.
  • Figure 2: The V-Reflection Architecture. A two-stage paradigm establishes a "think-then-look" visual self-reflection reasoning mechanism. (a) Stage 1: The Box-Guided Compression (BCM) module distills regional patches into grounded latent tokens $\mathbf{Z}_T$ via $\mathcal{L}_{BCM}$. (b) Stage 2: The Dynamic Autoregressive Compression (DAC) module distills the spatial expertise of the BCM module, training the LLM's hidden states $\mathbf{H}$ to act as dynamic probes that autonomously interrogate global features. (c) Inference: Both BCM and DAC remain entirely inactive, as they have been fully internalized to execute a purely end-to-end visual search driven by its latent reasoning process.
  • Figure 3: Visualization of Latent Reasoning during Training. Averaged attention maps across all reasoning steps demonstrate that while the teacher (BCM) is confined to local priors, the student (DAC) successfully transcends bounding box constraints to capture global contextual relationships.
  • Figure 4: Visualization of Latent Reasoning during Inference. Averaged attention maps across all latent reasoning steps (Col. 3) autonomously pinpointing visual evidence driven by latent states.
  • Figure 5: Visualization of visual latent distillation during training. While the teacher (BCM) is confined to local priors, the student (DAC) successfully transcends bounding box constraints to capture global contextual relationships.
  • ...and 3 more figures