Table of Contents
Fetching ...

Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration

Younan Zhu, Linwei Tao, Minjing Dong, Chang Xu

TL;DR

This work identifies Spatial Perception Bias in vision-token attention as a core driver of object hallucination in Large Vision-Language Models. It introduces two attention-calibration strategies: Uniform Attention Calibration (UAC), a training-free method that computes a calibration matrix $W$ to enforce uniform attention with $A'_{\text{img}} = W \circ A_{\text{img}}$, and Dynamic Attention Calibration (DAC), a plug-and-play module that learns to adjust $A'_{\text{img}} = f(A_{\text{img}})$ using a contrastive loss together with cross-entropy, optimized on augmented object-crop data via $\mathcal{L} = \mathcal{L}_{CE} + \lambda \mathcal{L}_{CL}$. Evaluations across LVLMs (e.g., LLaVA-1.5, mPLUG-Owl2, LLaVA-NeXT) and benchmarks (POPE, CHAIR, MME, LLaVA-Bench) show that UAC and DAC substantially reduce object hallucination and improve general multimodal alignment, achieving state-of-the-art results with minimal overhead. The results demonstrate strong improvements in both structured and open-ended tasks, with DAC providing the best overall gains while UAC offers a cost-efficient alternative. Limitations include sensitivity to validation data availability and the potential need for data-free extensions for calibration in low-resource settings.

Abstract

Large Vision-Language Models (LVLMs) exhibit impressive multimodal reasoning capabilities but remain highly susceptible to object hallucination, where models generate responses that are not factually aligned with the visual content. Recent works attribute this issue to an inherent bias of LVLMs where vision token attention map has a fixed correlation with spatial position, and propose to mitigate this issue by reordering visual tokens. However, we find that different LVLMs exhibit different correlations between attention and spatial position, which makes the existing solution difficult to generalize to other LVLMs. To address this issue, we first introduce a training-free solution, Uniform Attention Calibration (UAC), that estimates the bias from single meaningless input image and applies a calibration matrix to rectify attention imbalances. To further alleviate the bias, we relax the assumption of single meaningless input in UAC and introduce a fine-tuning solution, Dynamic Attention Calibration (DAC), that enforces the consistent outputs wherever the object locates in the image via a plug-and-plays module. Comprehensive experiments across multiple benchmarks demonstrate that UAC and DAC significantly reduce object hallucination while improving general multimodal alignment. Our methods achieve state-of-the-art performance across diverse LVLM architectures on various metrics.

Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration

TL;DR

This work identifies Spatial Perception Bias in vision-token attention as a core driver of object hallucination in Large Vision-Language Models. It introduces two attention-calibration strategies: Uniform Attention Calibration (UAC), a training-free method that computes a calibration matrix to enforce uniform attention with , and Dynamic Attention Calibration (DAC), a plug-and-play module that learns to adjust using a contrastive loss together with cross-entropy, optimized on augmented object-crop data via . Evaluations across LVLMs (e.g., LLaVA-1.5, mPLUG-Owl2, LLaVA-NeXT) and benchmarks (POPE, CHAIR, MME, LLaVA-Bench) show that UAC and DAC substantially reduce object hallucination and improve general multimodal alignment, achieving state-of-the-art results with minimal overhead. The results demonstrate strong improvements in both structured and open-ended tasks, with DAC providing the best overall gains while UAC offers a cost-efficient alternative. Limitations include sensitivity to validation data availability and the potential need for data-free extensions for calibration in low-resource settings.

Abstract

Large Vision-Language Models (LVLMs) exhibit impressive multimodal reasoning capabilities but remain highly susceptible to object hallucination, where models generate responses that are not factually aligned with the visual content. Recent works attribute this issue to an inherent bias of LVLMs where vision token attention map has a fixed correlation with spatial position, and propose to mitigate this issue by reordering visual tokens. However, we find that different LVLMs exhibit different correlations between attention and spatial position, which makes the existing solution difficult to generalize to other LVLMs. To address this issue, we first introduce a training-free solution, Uniform Attention Calibration (UAC), that estimates the bias from single meaningless input image and applies a calibration matrix to rectify attention imbalances. To further alleviate the bias, we relax the assumption of single meaningless input in UAC and introduce a fine-tuning solution, Dynamic Attention Calibration (DAC), that enforces the consistent outputs wherever the object locates in the image via a plug-and-plays module. Comprehensive experiments across multiple benchmarks demonstrate that UAC and DAC significantly reduce object hallucination while improving general multimodal alignment. Our methods achieve state-of-the-art performance across diverse LVLM architectures on various metrics.

Paper Structure

This paper contains 35 sections, 8 equations, 10 figures, 7 tables, 2 algorithms.

Figures (10)

  • Figure 1: Spatial Position Bias influences how LVLMs perceive objects based on their position within an image. The visualization above illustrates vision tokens attention weights during the decoding process for different models on a blank white image in response to the open-ended prompt: "Please describe the image in detail." (a) shows LLaVA-1.5, which exhibits an increasing trend in attention distribution following a raster scan order, as identified by xing2024cca. (b-c) represent other models, displaying arbitrary attention distributions. (d) depicts the calibrated vision tokens attention map of LLaVA-1.5 after Dynamic Attention Calibration.
  • Figure 2: DAC performance under different settings of $\lambda$ and $N_{DAC}$. Different lines represent various $\lambda$ values, while the y-axis indicates $N_{DAC}$. DAC is applied to 2 consecutive layers, and the results average sampling settings for POPE accuracy using LLaVA1.5-7B.
  • Figure 3: Vision tokens attention weights during the decoding process for different models on a blank white image in response to the polling prompt: "Is there a bear in the image?"
  • Figure 4: Vision tokens attention weights during the decoding process for different models on a blank black image in response to the polling prompt: "Is there a bear in the image?"
  • Figure 5: Vision tokens attention weights during the decoding process for different models on a blank noise image in response to the polling prompt: "Is there a bear in the image?"
  • ...and 5 more figures