Table of Contents
Fetching ...

DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models

Yangfu Li, Hongjian Zhan, Jiawei Chen, Yuning Gong, Qi Liu, Yue Lu

TL;DR

DeepScan is a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning for visually grounded reasoning in Large Vision-Language Models (LVLMs) and demonstrates that DeepScan significantly boosts LVLMs in diverse visual tasks, especially in fine-grained visual understanding.

Abstract

Humans can robustly localize visual evidence and provide grounded answers even in noisy environments by identifying critical cues and then relating them to the full context in a bottom-up manner. Inspired by this, we propose DeepScan, a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning for visually grounded reasoning in Large Vision-Language Models (LVLMs). Unlike existing methods that pursue one-shot localization of complete evidence, Hierarchical Scanning performs local cue exploration and multi-scale evidence extraction to recover evidence in a bottom-up manner, effectively mitigating the impacts of distractive context. Refocusing then optimizes the localized evidence view through collaboration of LVLMs and visual experts. Finally, Evidence-Enhanced Reasoning aggregates multi-granular views via a hybrid evidence memory and yields accurate and interpretable answers. Experimental results demonstrate that DeepScan significantly boosts LVLMs in diverse visual tasks, especially in fine-grained visual understanding. It achieves 90.6% overall accuracy on V* when integrated with Qwen2.5-VL-7B. Moreover, DeepScan provides consistent improvements for LVLMs across various architectures and model scales without additional adaptation cost.

DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models

TL;DR

DeepScan is a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning for visually grounded reasoning in Large Vision-Language Models (LVLMs) and demonstrates that DeepScan significantly boosts LVLMs in diverse visual tasks, especially in fine-grained visual understanding.

Abstract

Humans can robustly localize visual evidence and provide grounded answers even in noisy environments by identifying critical cues and then relating them to the full context in a bottom-up manner. Inspired by this, we propose DeepScan, a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning for visually grounded reasoning in Large Vision-Language Models (LVLMs). Unlike existing methods that pursue one-shot localization of complete evidence, Hierarchical Scanning performs local cue exploration and multi-scale evidence extraction to recover evidence in a bottom-up manner, effectively mitigating the impacts of distractive context. Refocusing then optimizes the localized evidence view through collaboration of LVLMs and visual experts. Finally, Evidence-Enhanced Reasoning aggregates multi-granular views via a hybrid evidence memory and yields accurate and interpretable answers. Experimental results demonstrate that DeepScan significantly boosts LVLMs in diverse visual tasks, especially in fine-grained visual understanding. It achieves 90.6% overall accuracy on V* when integrated with Qwen2.5-VL-7B. Moreover, DeepScan provides consistent improvements for LVLMs across various architectures and model scales without additional adaptation cost.
Paper Structure (21 sections, 11 equations, 17 figures, 10 tables, 2 algorithms)

This paper contains 21 sections, 11 equations, 17 figures, 10 tables, 2 algorithms.

Figures (17)

  • Figure 1: Performance of LVLMs and visually grounded reasoning variants on V*. DeepScan achieves highly competitive results.
  • Figure 2: Overall architecture of DeepScan: Hierarchical Scanning progressively recovers visual evidence from in-patch cues formulated as point-based proxies using Local Cue Exploration and Multi-Scale Evidence Extraction, e.g., $c^1_{T} \mapsto e_{t-1}$ in step $T$; Refocusing further refines the surrounding context for the fused evidence via interactions between LVLMs and visual experts; Evidence-Enhanced Reasoning leverages a Hybrid Evidence Memory to provide multi-granular information to the LVLM, enabling detailed yet comprehensive answers.
  • Figure 3: Illustration of morphological post-processing. + marks the point-based proxies; $m$ and $m^+$ denote evidence masks before and after post-processing, showing improved robustness.
  • Figure 4: Analysis of (left) performance gain w.r.t. target area ratio and (right) performance-latency trade-off on V*, where $\infty$ mark denotes the case without explicit truncation of candidate count $k$.
  • Figure 5: Illustration of Refocusing, where $V^*$ denotes its results. It reveals that Refocusing recalibrates the proxy misalignment by adaptively completing (left) or further amplifying (right) evidence.
  • ...and 12 more figures