Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

Sangmin Woo; Donguk Kim; Jaehyuk Jang; Yubin Choi; Changick Kim

Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

Sangmin Woo, Donguk Kim, Jaehyuk Jang, Yubin Choi, Changick Kim

TL;DR

The paper identifies blind tokens—image patches that attract excessive attention yet contribute little to the final prediction—as a key source of hallucinations in large vision-language models. It introduces AvisC, a training-free, test-time decoding method that detects blind tokens via layer-wise attention and mitigates their influence with a contrastive decoding step, without altering model parameters. Across POPE, MME, and AMBER benchmarks, AvisC consistently reduces hallucinations and improves factual grounding and descriptive quality for different LVLMs, demonstrating a practical, model-agnostic plug-in approach. The work highlights the importance of attention calibration at decoding time and lays groundwork for further exploration of attention-driven artifacts in multimodal transformers.

Abstract

Large Vision Language Models (LVLMs) demonstrate strong capabilities in visual understanding and description, yet often suffer from hallucinations, attributing incorrect or misleading features to images. We observe that LVLMs disproportionately focus on a small subset of image tokens--termed blind tokens--which are typically irrelevant to the query (e.g., background or non-object regions). We hypothesize that such attention misalignment plays a key role in generating hallucinated responses. To mitigate this issue, we propose Attentional Vision Calibration (AvisC), a test-time approach that dynamically recalibrates the influence of blind tokens without modifying the underlying attention mechanism. AvisC first identifies blind tokens by analyzing layer-wise attention distributions over image tokens, then employs a contrastive decoding strategy to balance the influence of original and blind-token-biased logits. Experiments on standard benchmarks, including POPE, MME, and AMBER, demonstrate that AvisC effectively reduces hallucinations in LVLMs.

Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

TL;DR

Abstract

Paper Structure (60 sections, 19 equations, 16 figures, 15 tables)

This paper contains 60 sections, 19 equations, 16 figures, 15 tables.

Introduction
Observations
Blind tokens in uniform images.
Mismatch between blind tokens and actual objects.
Zero‐out experiments.
Attention bias and hallucinations.
Hypothesis.
Approach: AvisC
LVLM Framework
Uni-modal encoding.
Cross-modal alignment.
Next token prediction via LLM.
Attentional Vision Calibration
Layer selection.
Blind token identification.
...and 45 more sections

Figures (16)

Figure 1: Blind tokens in LVLMs.(Top) Even when the image ($\mathcal{V}$) lacks meaningful content for the textual query ($\mathcal{Q})$, modern LVLMs dai2024instructblipliu2023visual still assign disproportionate attention to a few image tokens (i.e., blind tokens). Despite having identical, featureless yellow patches, these tokens dominate the attention distribution. (Bottom) In a real image, overlaying bounding boxes and LLaVA 1.5's attention map highlights a clear mismatch between blind tokens (red boxes) and genuinely informative regions. Note: attention weights are averaged across all layers for the first generated token. See \ref{['sec:appendix_attention_bias', 'sec:appendix_statistics_blind_token']} for more examples.
Figure 2: Blind tokens contribute little to actual predictions.(a) We perform zero-out experiments to measure the impact of blind vs. non-blind tokens. Zeroing out blind tokens (Zero-out > $\mu + \sigma$), where attention weights are above mean + standard deviation, leaves the model’s predicted probabilities nearly unchanged, suggesting that these tokens carry minimal object-discriminative information. In contrast, zeroing out non-blind tokens yields near 50:50 probabilities, underscoring their critical role in correct prediction. (b) When non-blind tokens are zeroed out, the models fails to correctly predict previously well-classified instances.
Figure 3: An overview of AvisC.
Figure 4: Layer-wise image attention proportion in LVLMs liu2023improveddai2024instructblip. This shows the proportion of attention given to image tokens at each layer relative to total attention. Different layers exhibit distinct attention patterns, which vary across models. Attention weights are averaged over 60 questions from the LLaVA-Bench liu2023visual.
Figure 5: Performance comparison on MME-Fullset.AvisC achieves top performance in 7 of 14 categories with InstructBLIP dai2024instructblip and in 11 categories with LLaVA-1.5 liu2023visual. Beyond minimizing hallucinations, AvisC also boosts the general functionality of LVLMs.
...and 11 more figures

Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

TL;DR

Abstract

Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (16)