Table of Contents
Fetching ...

AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM

Li'an Zhong, Ziqiang He, Jibin Zheng, Jin Li, Z. Jane Wang, Xiangui Kang

TL;DR

Attention to Generated Text (IAT) is proposed and it is demonstrated that it significantly reduces the hallucination rate while avoiding repetitive descriptions, achieving an attractive trade-off.

Abstract

Hallucination has been a significant impediment to the development and application of current Large Vision-Language Models (LVLMs). To mitigate hallucinations, one intuitive and effective way is to directly increase attention weights to image tokens during inference. Although this effectively reduces the hallucination rate, it often induces repetitive descriptions. To address this, we first conduct an analysis of attention patterns and reveal that real object tokens tend to assign higher attention to the generated text than hallucinated ones. This inspires us to leverage the generated text, which contains instruction-related visual information and contextual knowledge, to alleviate hallucinations while maintaining linguistic coherence. We therefore propose Attention to Generated Text (IAT) and demonstrate that it significantly reduces the hallucination rate while avoiding repetitive descriptions. To prevent naive amplification from impairing the inherent prediction capabilities of LVLMs, we further explore Adaptive IAT (AdaIAT) that employs a layer-wise threshold to control intervention time and fine-grained amplification magnitude tailored to the characteristics of each attention head. Both analysis and experiments demonstrate the effectiveness of AdaIAT. Results of several LVLMs show that AdaIAT effectively alleviates hallucination (reducing hallucination rates $C_S$ and $C_I$ on LLaVA-1.5 by 35.8% and 37.1%, respectively) while preserving linguistic performance and prediction capability, achieving an attractive trade-off.

AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM

TL;DR

Attention to Generated Text (IAT) is proposed and it is demonstrated that it significantly reduces the hallucination rate while avoiding repetitive descriptions, achieving an attractive trade-off.

Abstract

Hallucination has been a significant impediment to the development and application of current Large Vision-Language Models (LVLMs). To mitigate hallucinations, one intuitive and effective way is to directly increase attention weights to image tokens during inference. Although this effectively reduces the hallucination rate, it often induces repetitive descriptions. To address this, we first conduct an analysis of attention patterns and reveal that real object tokens tend to assign higher attention to the generated text than hallucinated ones. This inspires us to leverage the generated text, which contains instruction-related visual information and contextual knowledge, to alleviate hallucinations while maintaining linguistic coherence. We therefore propose Attention to Generated Text (IAT) and demonstrate that it significantly reduces the hallucination rate while avoiding repetitive descriptions. To prevent naive amplification from impairing the inherent prediction capabilities of LVLMs, we further explore Adaptive IAT (AdaIAT) that employs a layer-wise threshold to control intervention time and fine-grained amplification magnitude tailored to the characteristics of each attention head. Both analysis and experiments demonstrate the effectiveness of AdaIAT. Results of several LVLMs show that AdaIAT effectively alleviates hallucination (reducing hallucination rates and on LLaVA-1.5 by 35.8% and 37.1%, respectively) while preserving linguistic performance and prediction capability, achieving an attractive trade-off.
Paper Structure (18 sections, 13 equations, 7 figures, 6 tables)

This paper contains 18 sections, 13 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Case illustration: The attention intervention methods amplify the attention to image tokens, thereby emphasizing visual information and mitigating hallucinations. However, the relatively low attention to the generated text causes the model to forget preceding utterances, leading to repetitive descriptions of the prominent object 'clock tower'.
  • Figure 2: The mechanisms of mitigating hallucination based on Greedy in PAI and AdaIAT, with repeated descriptions annotated in red and hallucinations annotated in blue. While Greedy generates the hallucinated object “cars”, PAI enhances attention to image tokens with a fixed $\alpha$ to mitigate hallucination. However, it suffers from repeated subjects and monotonous, redundant language. In contrast, AdaIAT employs layer-wise thresholds to control the amplification and designs $\mathcal{M}^{(l,h)}$ for each attention head to adaptively enhance attention towards the generated text tokens, which produces accurate and hallucination-free captions.
  • Figure 3: Visualization of the average per-token attention weights from text token $t_{n+1}$ to generated text tokens $T_{p}$ ($\bar{\mathbf{A}}_{T_{p}}^{r}$ and $\bar{\mathbf{A}}_{T_{p}}^{h}$) and to image tokens $V$ ($\bar{\mathbf{A}}_{V}^{r}$ and $\bar{\mathbf{A}}_{V}^{h}$), showing only layers 5–18 for clearer observation.
  • Figure 4: Trends of textual diversity $D_1$ for different methods as the hallucination rate $C_S$ decreases. The dashed line denotes the $D_1$ of the original greedy decoding as a reference. Regions closer to the top-left indicate better performance.
  • Figure 5: The distributions of textual diversity $D_1$ for captions generated by different methods using the LLaVA-1.5-7B, where a higher $D_1$ corresponds to greater text diversity. The distribution is partially truncated for better visualization.
  • ...and 2 more figures