Table of Contents
Fetching ...

Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models

Niamul Hassan Samin, Md Arifur Rahman, Abdullah Ibne Hanif, Juena Ahmed Noshin, Md Ashikur Rahman

TL;DR

Spatial Credit Redistribution (SCR) is introduced, a training-free inference-time intervention that redistributes hidden-state activation from high-attention source patches to their context, guided by low-entropy inputs, consistent with the theoretical framework.

Abstract

Vision-language models (VLMs) frequently hallucinate objects absent from the input image. We trace this failure to spatial credit collapse: activation credit concentrating on sparse visual patches in early transformer layers, which suppresses contextual evidence and increases reliance on language priors. We introduce Spatial Credit Redistribution (SCR), a training-free inference-time intervention that redistributes hidden-state activation from high-attention source patches to their context, guided by low-entropy inputs. We evaluate six model families (Chameleon, LLaVA, and Qwen, including both Qwen-VL and Qwen2-VL) at scales of 7B, 13B, and 30B, on POPE and CHAIR benchmarks. SCR reduces hallucination by ~4.7-6.0 percentage points on POPE-Adversarial, cuts CHAIR-s by 3.7-5.2 percentage points (42-51 percent relative), and CHAIR-i by 2.7-4.4 percentage points (44-58 percent relative), and preserves CIDEr within 0.8 percentage points. Gains are largest for low-entropy inputs, consistent with the theoretical framework. SCR incurs only 43-56 ms overhead (small models: +43-46 ms; large models: +54-56 ms), roughly 3-6 times lower than OPERA and VCD and 1.3-1.7 times lower than OVCD (+72 ms), while Pareto-dominating all three on both hallucination rate and CIDEr, making it practical for real-time settings. A controlled ablation confirms that attention-guided source selection is essential: replacing it with uniform random selection reduces hallucination rate gains from ~4.7-6.0 percentage points to only ~2.6-3.4 percentage points, pointing to credit-collapse as the key driver.

Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models

TL;DR

Spatial Credit Redistribution (SCR) is introduced, a training-free inference-time intervention that redistributes hidden-state activation from high-attention source patches to their context, guided by low-entropy inputs, consistent with the theoretical framework.

Abstract

Vision-language models (VLMs) frequently hallucinate objects absent from the input image. We trace this failure to spatial credit collapse: activation credit concentrating on sparse visual patches in early transformer layers, which suppresses contextual evidence and increases reliance on language priors. We introduce Spatial Credit Redistribution (SCR), a training-free inference-time intervention that redistributes hidden-state activation from high-attention source patches to their context, guided by low-entropy inputs. We evaluate six model families (Chameleon, LLaVA, and Qwen, including both Qwen-VL and Qwen2-VL) at scales of 7B, 13B, and 30B, on POPE and CHAIR benchmarks. SCR reduces hallucination by ~4.7-6.0 percentage points on POPE-Adversarial, cuts CHAIR-s by 3.7-5.2 percentage points (42-51 percent relative), and CHAIR-i by 2.7-4.4 percentage points (44-58 percent relative), and preserves CIDEr within 0.8 percentage points. Gains are largest for low-entropy inputs, consistent with the theoretical framework. SCR incurs only 43-56 ms overhead (small models: +43-46 ms; large models: +54-56 ms), roughly 3-6 times lower than OPERA and VCD and 1.3-1.7 times lower than OVCD (+72 ms), while Pareto-dominating all three on both hallucination rate and CIDEr, making it practical for real-time settings. A controlled ablation confirms that attention-guided source selection is essential: replacing it with uniform random selection reduces hallucination rate gains from ~4.7-6.0 percentage points to only ~2.6-3.4 percentage points, pointing to credit-collapse as the key driver.
Paper Structure (17 sections, 4 equations, 4 figures, 6 tables)

This paper contains 17 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: SCR methodology overview. (a) Full two-pass inference pipeline. (b) Spatial credit redistribution via 8-connected neighborhood structure.
  • Figure 2: SCR performance across six VLM configurations (Cham-7B/30B, LLaVA-1.5-7B/13B, Qwen-VL, Qwen2-VL-7B) on POPE-Adversarial. SCR reduces hallucination rate by ${\approx}$4.7-6.0 pp across both early-fusion (Chameleon) and connector-based (LLaVA, Qwen) models across all tested architectures.
  • Figure 3: Entropy-stratified HR reduction across all six model configurations. SCR provides the greatest benefit for low-entropy inputs ($H < 3.5$, avg. 9.8 pp reduction) where credit is most collapsed, and diminishing gains for already-distributed credit ($H > 4.5$, avg. 2.4 pp), validating our theoretical framework.
  • Figure 4: Ablation studies. (b) Neighbor topology comparison. (c) Semantic structure vs. entropy control. (d) Head-selectivity and computational cost trade-off. See ablation (a) in text for the Uniform-Smooth attention-guidance comparison (Tables \ref{['tab:main_results']}-\ref{['tab:main_results_13b']}).