Table of Contents
Fetching ...

Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression

Sreetama Sarkar, Yue Che, Alex Gavin, Peter A. Beerel, Souvik Kundu

TL;DR

This work identifies image-inattentive attention heads as a key driver of hallucinations in vision-language models and introduces SPIN, a training-free, inference-time head suppression method. SPIN dynamically masks heads for each query token, preserving the top-$k$ vision-attentive heads and suppressing the rest with a factor $\alpha$, yielding a final multi-head attention form $\text{MHA}_{Q,K,V,m} = \left( \bigoplus_{i=1}^{H} (m_i \cdot h_i) \right) W_o$. Across multiple LVLMs and decoding strategies, SPIN achieves substantial hallucination reductions (up to $2.7\times$ CHAIR improvements and related metrics) while maintaining F1 and improving throughput by up to $1.8\times$, demonstrating a practical, low-latency path to better grounding in multimodal systems. The approach emphasizes a systematic, ablation-driven selection of pruning parameters, showing that pruning is more effective when focusing on image-attention rather than text-only cues, and remains applicable to a broad range of architectures with accessible weights. Limitations include reduced effectiveness with stochastic decoding like nucleus sampling and dependence on model-access to weights; future work could explore adaptive routers and broader evaluation on API-restricted models.

Abstract

Despite their remarkable progress in multimodal understanding tasks, large vision language models (LVLMs) often suffer from "hallucinations", generating texts misaligned with the visual context. Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency. To mitigate this, we present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference, without incurring any significant compute or latency overhead. We investigate whether hallucination in LVLMs can be linked to specific model components. Our analysis suggests that hallucinations can be attributed to a dynamic subset of attention heads in each layer. Leveraging this insight, for each text query token, we selectively suppress attention heads that exhibit low attention to image tokens, keeping the top-K attention heads intact. Extensive evaluations on visual question answering and image description tasks demonstrate the efficacy of SPIN in reducing hallucination scores up to 2.7x while maintaining F1, and improving throughput by 1.8x compared to existing alternatives. Code is available at https://github.com/YUECHE77/SPIN.

Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression

TL;DR

This work identifies image-inattentive attention heads as a key driver of hallucinations in vision-language models and introduces SPIN, a training-free, inference-time head suppression method. SPIN dynamically masks heads for each query token, preserving the top- vision-attentive heads and suppressing the rest with a factor , yielding a final multi-head attention form . Across multiple LVLMs and decoding strategies, SPIN achieves substantial hallucination reductions (up to CHAIR improvements and related metrics) while maintaining F1 and improving throughput by up to , demonstrating a practical, low-latency path to better grounding in multimodal systems. The approach emphasizes a systematic, ablation-driven selection of pruning parameters, showing that pruning is more effective when focusing on image-attention rather than text-only cues, and remains applicable to a broad range of architectures with accessible weights. Limitations include reduced effectiveness with stochastic decoding like nucleus sampling and dependence on model-access to weights; future work could explore adaptive routers and broader evaluation on API-restricted models.

Abstract

Despite their remarkable progress in multimodal understanding tasks, large vision language models (LVLMs) often suffer from "hallucinations", generating texts misaligned with the visual context. Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency. To mitigate this, we present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference, without incurring any significant compute or latency overhead. We investigate whether hallucination in LVLMs can be linked to specific model components. Our analysis suggests that hallucinations can be attributed to a dynamic subset of attention heads in each layer. Leveraging this insight, for each text query token, we selectively suppress attention heads that exhibit low attention to image tokens, keeping the top-K attention heads intact. Extensive evaluations on visual question answering and image description tasks demonstrate the efficacy of SPIN in reducing hallucination scores up to 2.7x while maintaining F1, and improving throughput by 1.8x compared to existing alternatives. Code is available at https://github.com/YUECHE77/SPIN.

Paper Structure

This paper contains 24 sections, 5 equations, 12 figures, 17 tables.

Figures (12)

  • Figure 1: Caption generation using LLaVA-1.5 and SPIN. LLaVA-1.5's generated text description mentions a "chair" in the background, which is clearly a hallucinated object. SPIN mitigates hallucination while successfully identifying the objects present in the image.
  • Figure 2: Average attention allocated by the current token to the preceding vision and text tokens in CHAIR. Image tokens receive $<$10% of total attention from layer 3, while constituting $\sim$76-92% of the input.
  • Figure 3: CHAIR scores for SPIN compared with existing approaches for greedy, beam-search, and nucleus sampling based decoding on LLaVA-1.5 (7B) (top) and Qwen-VL (bottom).
  • Figure 4: MMHal-Bench evaluation on LLaVA-1.5 (13B), MiniGPT-4 and Shikra.
  • Figure 5: Ablation on the suppression factor ($\alpha$) for four models.
  • ...and 7 more figures