Table of Contents
Fetching ...

Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification

Shuai Lv, Chang Liu, Feng Tang, Yujie Yuan, Aojun Zhou, Kui Zhang, Xi Yang, Yangqiu Song

Abstract

Multimodal Large Language Models (MLLMs) achieve strong multimodal reasoning performance, yet we identify a recurring failure mode in long-form generation: as outputs grow longer, models progressively drift away from image evidence and fall back on textual priors, resulting in ungrounded reasoning and hallucinations. Interestingly, Based on attention analysis, we find that MLLMs have a latent capability for late-stage visual verification that is present but not consistently activated. Motivated by this observation, we propose Visual Re-Examination (VRE), a self-evolving training framework that enables MLLMs to autonomously perform visual introspection during reasoning without additional visual inputs. Rather than distilling visual capabilities from a stronger teacher, VRE promotes iterative self-improvement by leveraging the model itself to generate reflection traces, making visual information actionable through information gain. Extensive experiments across diverse multimodal benchmarks demonstrate that VRE consistently improves reasoning accuracy and perceptual reliability, while substantially reducing hallucinations, especially in long-chain settings. Code is available at https://github.com/Xiaobu-USTC/VRE.

Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification

Abstract

Multimodal Large Language Models (MLLMs) achieve strong multimodal reasoning performance, yet we identify a recurring failure mode in long-form generation: as outputs grow longer, models progressively drift away from image evidence and fall back on textual priors, resulting in ungrounded reasoning and hallucinations. Interestingly, Based on attention analysis, we find that MLLMs have a latent capability for late-stage visual verification that is present but not consistently activated. Motivated by this observation, we propose Visual Re-Examination (VRE), a self-evolving training framework that enables MLLMs to autonomously perform visual introspection during reasoning without additional visual inputs. Rather than distilling visual capabilities from a stronger teacher, VRE promotes iterative self-improvement by leveraging the model itself to generate reflection traces, making visual information actionable through information gain. Extensive experiments across diverse multimodal benchmarks demonstrate that VRE consistently improves reasoning accuracy and perceptual reliability, while substantially reducing hallucinations, especially in long-chain settings. Code is available at https://github.com/Xiaobu-USTC/VRE.

Paper Structure

This paper contains 43 sections, 5 equations, 15 figures, 7 tables, 1 algorithm.

Figures (15)

  • Figure 1: Overall pipeline of the proposed Visual Reflection Enhancement (VRE) framework. Left: Reflection data construction, where difficulty-aware filtering and information-gain-guided reflection synthesis generate high-quality SFT data. Right: Model optimization, where the cold-start model is iteratively improved through rejection sampling and reinforcement learning under a structured reward scheme comprising format, accuracy, and reflection rewards.
  • Figure 2: Visualizing Implicit Visual Re-Examination via Attention. The heatmaps display attention weights on visual tokens across inferences steps. (Left) The Base Model exhibits visual decay, where attention to the image vanishes as the textual context grows, leading to ungrounded hallucinations. (Right) The VRE Model demonstrates a spontaneous attention resurgence. Inside the <reflection> block, the model sharply re-allocates attention back to the visual features, proving that the mechanism actively triggers a re-examination behavior to extract missing visual evidence.
  • Figure 3: Visualizing the Re-examination Mechanism: From Blindness to Grounding. Phase 1: Initial Reasoning. During the initial pass, the model's attention is dispersed over the background, failing to locate the dustpan. Phase 2: Visual Re-examination. Triggered by the reflection token, the model performs an active visual search. The attention map shows a sharp, targeted focus on the dustpan (red region), successfully retrieving the correct visual evidence to fix the answer.
  • Figure 4: Evolution of Reflection Paradigms. The stacked bar charts detail the distribution of introspection types across training stages.
  • Figure 5: System prompt template for information gains.
  • ...and 10 more figures