Table of Contents
Fetching ...

CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models

Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Libo Qin, Ruihan Chen, Baohang Li, Kui Jiang, Yaowei Wang, Ting Liu, Bing Qin

TL;DR

The paper tackles object hallucination in large vision-language models by revealing that caption queries can more strongly activate visual attention. It proposes Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play pipeline that identifies caption-sensitive attention heads and applies learned attention-shift vectors during inference to strengthen visual perception. CAI demonstrates state-of-the-art hallucination mitigation across POPE, MME, CHAIR, and MMHal-Bench with only modest latency overhead and broad generalizability to open-source LVLMs. The work also provides head-level analyses and ablations that illuminate the role of specific attention heads and hyperparameters in balancing visual and semantic processing.

Abstract

Although Large Vision-Language Models (LVLMs) have demonstrated powerful capabilities in interpreting visual information, they frequently produce content that deviates from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or significantly increase inference time. In this work, we observe that LVLMs' attention to visual information is significantly stronger when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern in response to caption queries to enhance LVLMs' visual perception capability. Extensive experimental results across four benchmarks covering both discriminative and generative tasks, demonstrate that CAI achieves state-of-the-art (SOTA) hallucination mitigating performance only with minimal additional inference cost.

CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models

TL;DR

The paper tackles object hallucination in large vision-language models by revealing that caption queries can more strongly activate visual attention. It proposes Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play pipeline that identifies caption-sensitive attention heads and applies learned attention-shift vectors during inference to strengthen visual perception. CAI demonstrates state-of-the-art hallucination mitigation across POPE, MME, CHAIR, and MMHal-Bench with only modest latency overhead and broad generalizability to open-source LVLMs. The work also provides head-level analyses and ablations that illuminate the role of specific attention heads and hyperparameters in balancing visual and semantic processing.

Abstract

Although Large Vision-Language Models (LVLMs) have demonstrated powerful capabilities in interpreting visual information, they frequently produce content that deviates from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or significantly increase inference time. In this work, we observe that LVLMs' attention to visual information is significantly stronger when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern in response to caption queries to enhance LVLMs' visual perception capability. Extensive experimental results across four benchmarks covering both discriminative and generative tasks, demonstrate that CAI achieves state-of-the-art (SOTA) hallucination mitigating performance only with minimal additional inference cost.

Paper Structure

This paper contains 33 sections, 18 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: The visualization of attention weights at image patch level across different conversations. LVLM correctly generates the detailed content of the image in response to the caption query, but exhibits hallucination (e.g., "helmet") when answering the non-caption query. CAI refines LVLM's visual attention patterns from insufficient to sufficient, effectively enhancing visual perception capability and mitigating hallucination.
  • Figure 2: A systematic quantitative analysis from head-wise (a) and layer-wise (b) on visual attention weights. The comparison shows that the caption query significantly enhanced the visual attention of LLaVA-1.5-7b.
  • Figure 3: An overview of the CAI method. Each square in the matrix represents the attention head output. Squares with dark green color indicate refined attention head outputs. CAI consists of three stages: (1) §\ref{['search']} Best caption query search algorithm is designed to seek the best optimization target query with minimal necessary attention weight shift. (2) §\ref{['probe']} The original and modified attention outputs are used to identify caption-sensitive attention heads and compute attention output shift vectors. (3) §\ref{['intervention']} Precomputed attention shift vectors are applied to the top $K$ caption-sensitive attention heads during inference, thereby enhancing their visual attention and activating the model's inherent fine-grained visual perception to mitigate hallucination.
  • Figure 4: Main result of LLaVA-1.5-7b on MS-COCO CHAIR task. Smaller values of $\mathrm{CHAIR}_i$ and $\mathrm{CHAIR}_s$ indicate that the method demonstrates stronger hallucination mitigation capabilities at instance and sentence levels. ${Max\_new\_tokens}$ is set to be 64.
  • Figure 5: The accuracies of baselines and CAI with different caption queries on GQA Random POPE task.
  • ...and 9 more figures