Table of Contents
Fetching ...

MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions

Haoyu Zhang, Yuwei Wu, Pengxiang Li, Xintong Zhang, Zhi Gao, Rui Gao, Mingyang Gao, Che Sun, Yunde Jia

TL;DR

Experiments show that MIRROR improves correctness and reduces visual hallucinations, demonstrating the value of training reflection as an evidence-seeking, region-aware verification process rather than a purely textual revision step.

Abstract

In the era of Vision-Language Models (VLMs), enhancing multimodal reasoning capabilities remains a critical challenge, particularly in handling ambiguous or complex visual inputs, where initial inferences often lead to hallucinations or logic errors. Existing VLMs often produce plausible yet ungrounded answers, and even when prompted to "reflect", their corrections may remain detached from the image evidence. To address this, we propose the MIRROR framework for Multimodal Iterative Reasoning via Reflection On visual Regions. By embedding visual reflection as a core mechanism, MIRROR is formulated as a closed-loop process comprising draft, critique, region-based verification, and revision, which are repeated until the output is visually grounded. To facilitate training of this model, we construct **ReflectV**, a visual reflective dataset for multi-turn supervision that explicitly contains reflection triggers, region-based verification actions, and answer revision grounded in visual evidence. Experiments on both general vision-language benchmarks and representative vision-language reasoning benchmarks show that MIRROR improves correctness and reduces visual hallucinations, demonstrating the value of training reflection as an evidence-seeking, region-aware verification process rather than a purely textual revision step.

MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions

TL;DR

Experiments show that MIRROR improves correctness and reduces visual hallucinations, demonstrating the value of training reflection as an evidence-seeking, region-aware verification process rather than a purely textual revision step.

Abstract

In the era of Vision-Language Models (VLMs), enhancing multimodal reasoning capabilities remains a critical challenge, particularly in handling ambiguous or complex visual inputs, where initial inferences often lead to hallucinations or logic errors. Existing VLMs often produce plausible yet ungrounded answers, and even when prompted to "reflect", their corrections may remain detached from the image evidence. To address this, we propose the MIRROR framework for Multimodal Iterative Reasoning via Reflection On visual Regions. By embedding visual reflection as a core mechanism, MIRROR is formulated as a closed-loop process comprising draft, critique, region-based verification, and revision, which are repeated until the output is visually grounded. To facilitate training of this model, we construct **ReflectV**, a visual reflective dataset for multi-turn supervision that explicitly contains reflection triggers, region-based verification actions, and answer revision grounded in visual evidence. Experiments on both general vision-language benchmarks and representative vision-language reasoning benchmarks show that MIRROR improves correctness and reduces visual hallucinations, demonstrating the value of training reflection as an evidence-seeking, region-aware verification process rather than a purely textual revision step.
Paper Structure (37 sections, 3 equations, 34 figures, 8 tables)

This paper contains 37 sections, 3 equations, 34 figures, 8 tables.

Figures (34)

  • Figure 1: Our visual reflective framework MIRROR improves visual question answering by iteratively verifying evidence in the image and revising the prediction. In each example, Round 1 produces an incorrect answer; the model then reflects with explicit visual grounding (e.g., yellow points/ purple ellipse) to re-check the relevant regions and corrects the response in Round 2, yielding a final accurate answer.
  • Figure 2: MIRROR performs closed-loop visual reflection. The VLM alternates between drafting an answer, reflecting, invoking a visual tool for region-level verification, and revising based on the rendered visual evidence.
  • Figure 3: Overview of the ReflectV dataset construction pipeline. We transform external feedback into self-reflection using Qwen2.5-7B and ensure visual grounding via Molmo-7B and SAM 2.
  • Figure 4: Qualitative examples of iterative visual reflection. Compared to the baseline Qwen2.5-VL and the tool-free variant, MIRROR successfully corrects initial perception and reasoning errors. Top (Spatial Reasoning): The model initially misses the "green cylinder" (Round 1). By triggering the visual prompt generator to mark the neglected object with a blue circle, it successfully recounts and corrects the answer in Round 2. Bottom (Object Identification): Addressing the hallucination where the chair is neglected, the model uses the cyan point to actively verify the visual features, confirming the presence of "the chair".
  • Figure 5: Statistics and distribution of the ReflectV dataset construction. (a). Domain Distribution: The composition of the raw data spans four distinct capabilities: General QA, Document Understanding (Doc), Scene Text (OCR), and Chart Reasoning. (b). Filtering Pipeline: The data volume retention across the three construction stages (Original $\rightarrow$ Response-Filtered $\rightarrow$ GT-Filtered), illustrating the rigorous quality control process described in \ref{['sec4:data_filter']}. (c). Trajectory Depth: We set the default mixing ratio to $\rho=0.75$ for our training. The distribution of samples based on the number of reflective rounds required for convergence illustrates the varying complexity of error correction.
  • ...and 29 more figures