See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs

Yongchang Zhang; Xianzheng Ma; Tianyi Liu; Guangquan Zhou; Yang Chen

See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs

Yongchang Zhang, Xianzheng Ma, Tianyi Liu, Guangquan Zhou, Yang Chen

TL;DR

This work presents a lightweight method that bypasses RL training and provides an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning, substantially reducing hallucination rates while improving reasoning accuracy without additional training.

Abstract

Recent large vision-language models (LVLMs) have demonstrated impressive reasoning ability by generating long chain-of-thought (CoT) responses. However, CoT reasoning in multimodal contexts is highly vulnerable to visual hallucination propagation: once an intermediate reasoning step becomes inconsistent with the visual evidence, subsequent steps-even if logically valid-can still lead to incorrect final answers. Existing solutions attempt to mitigate this issue by training models to "think with images" via reinforcement learning (RL). While effective, these methods are costly, model-specific, and difficult to generalize across architectures. Differently, we present a lightweight method that bypasses RL training and provides an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning. Our key idea is to supervise each reasoning step at test time with visual evidence, ensuring that every decoded token is justified by corresponding visual cues. Concretely, we construct a textual visual-evidence pool that guides the model's reasoning generation. When existing evidence is insufficient, a visual decider module dynamically extracts additional relevant evidence from the image based on the ongoing reasoning context, expanding the pool until the model achieves sufficient visual certainty to terminate reasoning and produce the final answer. Extensive experiments on multiple LVLM backbones and benchmarks demonstrate the effectiveness of our approach. Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench, substantially reducing hallucination rates while improving reasoning accuracy without additional training.

See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs

TL;DR

Abstract

Paper Structure (13 sections, 14 equations, 5 figures, 6 tables)

This paper contains 13 sections, 14 equations, 5 figures, 6 tables.

Introduction
Methodology
Visual Description Grounded Decoding
Distribution Supervisor
Dynamic evidence pool and the visual decider
Experiment
Setups
Benchmarks and base models
Main results and analysis
Ablation study
Efficiency analysis and uncertainty threshold
Qualitative analysis: how ECRD works
Conclusion

Figures (5)

Figure 1: Reasoning pattern comparison. (a) Greedy decoding: the base VLM selects the top-1 token at each step; any hallucination in an intermediate step propagates to an incorrect final answer. (b) RLHF-based “think-with-images”: the model learns when to call tools to zoom or crop the image and re-inject cropped regions into the reasoning context—effective but costly and model-specific. (c) Ours: a lightweight, training-free, model-agnostic framework. A supervisor maintains a dynamic visual-evidence pool to detect and correct hallucination steps. When uncertainty arises, it invokes a visual decider to extract new evidence, enabling visually grounded reasoning throughout the chain.
Figure 2: Overview of evidence-constrained reweighting decoding (ECRD) at decoding step $i$. The base VLM emits a top-k candidate set; the supervisor builds an evidence-induced distribution from the current evidence pool and negotiates with the base probabilities to reweight candidates. If confidence remains low, the visual decider reads the image with the current prefix, commits a token, and adds a short textual evidence for later steps.
Figure 3: Analysis of the uncertainty threshold $\delta$: accuracy as a function of $\delta$ across five benchmarks, and the average visual decider invocation rate (calls per question) as $\delta$ varies. The gray dashed line marks $\delta=0.08$.
Figure 4: Two typical application cases of ECRD.
Figure 5: Breakdown of ECRD’s gain on TreeBench based on Qwen2.5-VL-7B.

See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs

TL;DR

Abstract

See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (5)