Table of Contents
Fetching ...

SAGE: Sink-Aware Grounded Decoding for Multimodal Hallucination Mitigation

Tripti Shukla, Zsolt Kira

Abstract

Large vision-language models (VLMs) frequently suffer from hallucinations, generating content that is inconsistent with visual inputs. Existing methods typically address this problem through post-hoc filtering, additional training objectives, or external verification, but they do not intervene during the decoding process when hallucinations arise. In this work, we introduce SAGE, a Sink-Aware Grounded Decoding framework that mitigates hallucinations by dynamically modulating self-attention during generation. Hallucinations are strongly correlated with attention sink tokens - punctuation or function tokens that accumulate disproportionate attention despite carrying limited semantic content. SAGE leverages these tokens as anchors to monitor grounding reliability in real time. At each sink trigger, the method extracts semantic concepts from the generated sequence, estimates their visual grounding using both self-attention maps and gradient-based attribution, and measures their spatial agreement. Based on this signal, self-attention distributions are adaptively sharpened or broadened to reinforce grounded regions or suppress unreliable ones. Extensive experiments across diverse hallucination benchmarks demonstrate that SAGE consistently outperforms existing decoding strategies, achieving substantial reductions in hallucination while preserving descriptive coverage, without requiring model retraining or architectural modifications. Our method achieves an average relative improvement of 10.65% on MSCOCO and 7.19% on AMBER across diverse VLM architectures, demonstrating consistent gains in hallucination mitigation.

SAGE: Sink-Aware Grounded Decoding for Multimodal Hallucination Mitigation

Abstract

Large vision-language models (VLMs) frequently suffer from hallucinations, generating content that is inconsistent with visual inputs. Existing methods typically address this problem through post-hoc filtering, additional training objectives, or external verification, but they do not intervene during the decoding process when hallucinations arise. In this work, we introduce SAGE, a Sink-Aware Grounded Decoding framework that mitigates hallucinations by dynamically modulating self-attention during generation. Hallucinations are strongly correlated with attention sink tokens - punctuation or function tokens that accumulate disproportionate attention despite carrying limited semantic content. SAGE leverages these tokens as anchors to monitor grounding reliability in real time. At each sink trigger, the method extracts semantic concepts from the generated sequence, estimates their visual grounding using both self-attention maps and gradient-based attribution, and measures their spatial agreement. Based on this signal, self-attention distributions are adaptively sharpened or broadened to reinforce grounded regions or suppress unreliable ones. Extensive experiments across diverse hallucination benchmarks demonstrate that SAGE consistently outperforms existing decoding strategies, achieving substantial reductions in hallucination while preserving descriptive coverage, without requiring model retraining or architectural modifications. Our method achieves an average relative improvement of 10.65% on MSCOCO and 7.19% on AMBER across diverse VLM architectures, demonstrating consistent gains in hallucination mitigation.

Paper Structure

This paper contains 17 sections, 5 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: We propose SAGE, a sink-aware grounded decoding strategy for mitigating hallucinations in VLMs. SAGE detects sink tokens during generation and dynamically recalibrates attention towards image tokens to produce visually grounded descriptions.
  • Figure 2: Overview of our inference-time decoding strategy, SAGE. Stage 1: SAGE identifies attention sink tokens and extracts key semantic concepts $C$ from the partially generated output. Stage 2: We project the sink-token attention onto the image to obtain an attention-based grounding map $i_1$, and compute concept-level visual grounding $i_2$ using Grad-CAM from the final layer of the vision encoder. An overlap score $o$ is then calculated between the two maps. Stage 3: Based on a threshold $\tau$, SAGE adaptively modulates attention: when $o < \tau$, attention is diffused to explore alternative image regions; otherwise, attention is reinforced to strengthen grounding in the same regions.
  • Figure 3: (a) Layer-wise analysis showing the average IoU between spatial self-attention maps of ground-truth object tokens and their corresponding bounding boxes, alongside attention entropy. Results are averaged over $500$ images from MSCOCO. (b) Grad-CAM visualizations from the final layer of the vision encoder for individual concepts and their union.
  • Figure 4: Qualitative results of SAGE, including sink-token spatial attention, GradCAM-derived visual grounding maps, the computed overlap score $o$, and the modulated spatial attention maps. The sink token is highlighted in red.
  • Figure 5: Qualitative results of SAGE, including sink-token spatial attention, GradCAM-derived visual grounding maps, the computed overlap score $o$, and the modulated spatial attention maps. The sink token is highlighted in red.
  • ...and 1 more figures