Table of Contents
Fetching ...

TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection

Chunzhao Xie, Tongxuan Liu, Lei Jiang, Yuting Zeng, jinrong Guo, Yunheng Shen, Weizhe Huang, Jing Li, Xiaohua Xu

TL;DR

This work tackles hallucination in large vision-language models by revealing that attention to image tokens decays during generation, which correlates with hallucinations. It introduces TARAC, a training-free method that maintains a temporally accumulated image-token attention and injects it into the current decoding step, thereby reinforcing visual grounding. Across multiple LVLMs and benchmarks (e.g., CHAIR, AMBER, SHR), TARAC substantially reduces hallucinations while preserving language quality and incurring only modest inference overhead. The results demonstrate that a simple, inference-time attention manipulation can meaningfully improve the reliability of LVLMs in both generative and discriminative tasks, with practical implications for deployment in real-world applications.

Abstract

Large Vision-Language Models have demonstrated remarkable performance across various tasks; however, the challenge of hallucinations constrains their practical applications. The hallucination problem arises from multiple factors, including the inherent hallucinations in language models, the limitations of visual encoders in perception, and biases introduced by multimodal data. Extensive research has explored ways to mitigate hallucinations. For instance, OPERA prevents the model from overly focusing on "anchor tokens", thereby reducing hallucinations, whereas VCD mitigates hallucinations by employing a contrastive decoding approach. In this paper, we investigate the correlation between the decay of attention to image tokens and the occurrence of hallucinations. Based on this finding, we propose Temporal Attention Real-time Accumulative Connection (TARAC), a novel training-free method that dynamically accumulates and updates LVLMs' attention on image tokens during generation. By enhancing the model's attention to image tokens, TARAC mitigates hallucinations caused by the decay of attention on image tokens. We validate the effectiveness of TARAC across multiple models and datasets, demonstrating that our approach substantially mitigates hallucinations. In particular, TARAC reduces $C_S$ by 25.2 and $C_I$ by 8.7 compared to VCD on the CHAIR benchmark.

TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection

TL;DR

This work tackles hallucination in large vision-language models by revealing that attention to image tokens decays during generation, which correlates with hallucinations. It introduces TARAC, a training-free method that maintains a temporally accumulated image-token attention and injects it into the current decoding step, thereby reinforcing visual grounding. Across multiple LVLMs and benchmarks (e.g., CHAIR, AMBER, SHR), TARAC substantially reduces hallucinations while preserving language quality and incurring only modest inference overhead. The results demonstrate that a simple, inference-time attention manipulation can meaningfully improve the reliability of LVLMs in both generative and discriminative tasks, with practical implications for deployment in real-world applications.

Abstract

Large Vision-Language Models have demonstrated remarkable performance across various tasks; however, the challenge of hallucinations constrains their practical applications. The hallucination problem arises from multiple factors, including the inherent hallucinations in language models, the limitations of visual encoders in perception, and biases introduced by multimodal data. Extensive research has explored ways to mitigate hallucinations. For instance, OPERA prevents the model from overly focusing on "anchor tokens", thereby reducing hallucinations, whereas VCD mitigates hallucinations by employing a contrastive decoding approach. In this paper, we investigate the correlation between the decay of attention to image tokens and the occurrence of hallucinations. Based on this finding, we propose Temporal Attention Real-time Accumulative Connection (TARAC), a novel training-free method that dynamically accumulates and updates LVLMs' attention on image tokens during generation. By enhancing the model's attention to image tokens, TARAC mitigates hallucinations caused by the decay of attention on image tokens. We validate the effectiveness of TARAC across multiple models and datasets, demonstrating that our approach substantially mitigates hallucinations. In particular, TARAC reduces by 25.2 and by 8.7 compared to VCD on the CHAIR benchmark.

Paper Structure

This paper contains 30 sections, 6 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Case Analysis. LLaVA’s hallucinated responses and the correct responses with TARAC, along with the corresponding visual attention, are presented above. When visual attention is low, the model correspondingly generates hallucinated sentences. After applying TARAC, visual attention significantly increases, and the corresponding hallucinations no longer occur.
  • Figure 2: Architecture of TARAC. TARAC is applied to the Attention module of the Transformer in the LLM and proceeds in three steps during the generation of each token(time step): first, it captures the attention on image tokens and updates the accumulated attention; then, it injects the accumulated attention into the attention of current generating token; finally, it renormalizes the attention weights. The LLM in the figure is unfolded along the temporal dimension, with $t$ representing time steps for clarity.
  • Figure 3: (a) Effect of different parameters under CHAIR evaluation, with $\beta*10 + \alpha$ mapped to the x-axis for clarity. (b) Selected best parameters: $\alpha=0.9, \beta=0.3$.
  • Figure 4: Comparison of image token attention w/ and w/o TARAC, showing overall enhanced visual attention and a stronger attention sink effect.
  • Figure 5: Inference efficency comparison between TARAC and other methods. Time Per Output Token(TPOT) is reported as a multiple of the baseline. GPU memory cost is the increase in peak usage (MB) relative to the baseline, averaged over 10 runs.
  • ...and 4 more figures