Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression
Sreetama Sarkar, Yue Che, Alex Gavin, Peter A. Beerel, Souvik Kundu
TL;DR
This work identifies image-inattentive attention heads as a key driver of hallucinations in vision-language models and introduces SPIN, a training-free, inference-time head suppression method. SPIN dynamically masks heads for each query token, preserving the top-$k$ vision-attentive heads and suppressing the rest with a factor $\alpha$, yielding a final multi-head attention form $\text{MHA}_{Q,K,V,m} = \left( \bigoplus_{i=1}^{H} (m_i \cdot h_i) \right) W_o$. Across multiple LVLMs and decoding strategies, SPIN achieves substantial hallucination reductions (up to $2.7\times$ CHAIR improvements and related metrics) while maintaining F1 and improving throughput by up to $1.8\times$, demonstrating a practical, low-latency path to better grounding in multimodal systems. The approach emphasizes a systematic, ablation-driven selection of pruning parameters, showing that pruning is more effective when focusing on image-attention rather than text-only cues, and remains applicable to a broad range of architectures with accessible weights. Limitations include reduced effectiveness with stochastic decoding like nucleus sampling and dependence on model-access to weights; future work could explore adaptive routers and broader evaluation on API-restricted models.
Abstract
Despite their remarkable progress in multimodal understanding tasks, large vision language models (LVLMs) often suffer from "hallucinations", generating texts misaligned with the visual context. Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency. To mitigate this, we present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference, without incurring any significant compute or latency overhead. We investigate whether hallucination in LVLMs can be linked to specific model components. Our analysis suggests that hallucinations can be attributed to a dynamic subset of attention heads in each layer. Leveraging this insight, for each text query token, we selectively suppress attention heads that exhibit low attention to image tokens, keeping the top-K attention heads intact. Extensive evaluations on visual question answering and image description tasks demonstrate the efficacy of SPIN in reducing hallucination scores up to 2.7x while maintaining F1, and improving throughput by 1.8x compared to existing alternatives. Code is available at https://github.com/YUECHE77/SPIN.
