Table of Contents
Fetching ...

TPC: Cross-Temporal Prediction Connection for Vision-Language Model Hallucination Reduction

Chao Wang, Weiwei Fu, Yang Zhou

TL;DR

This work addresses object and text hallucination in vision-language models by identifying a logits-level property: adjacent timesteps exhibit higher distributional similarity. It introduces Cross-Temporal Prediction Connection (TPC), a training-free, plug-and-play approach that connects logits across time via Linear Temporal Prediction Connection (LTPC) or Attenuation Temporal Prediction Connection (ATPC) to improve semantic consistency without retraining. The method consistently reduces hallucinations and improves open-ended generation across datasets (e.g., POPE, MMHal-Bench, MME) and models (LLaVA, QwenVL, MiniGPT4) while maintaining efficiency, outperforming prior post-hoc techniques like VCD and DoLa. The results show notable gains in accuracy and F1 on object hallucination tasks, robust performance under hallucination prompts, and favorable qualitative cases, suggesting practical impact for reliable multimodal generation in high-stakes contexts.

Abstract

Vision-language models (VLMs) have achieved remarkable advancements, capitalizing on the impressive capabilities of large language models (LLMs) across diverse tasks. Despite this, a critical challenge known as hallucination occurs when models overconfidently describe objects or attributes absent from the image, a problem exacerbated by the tendency of VLMs to rely on linguistic priors. This limitation reduces model reliability in high-stakes applications. In this work, we have observed the characteristic of logits' continuity consistency enhancement and introduced a straightforward and efficient method, Cross-Temporal Prediction Connection (TPC), designed to enhance the semantic consistency of logits by connecting them temporally across timesteps. TPC amplifies information flow and improves coherence, effectively reducing hallucination. Extensive experiments show that TPC surpasses existing representatives, delivering superior performance in both accuracy and efficiency while maintaining robustness in open-ended text generation tasks.

TPC: Cross-Temporal Prediction Connection for Vision-Language Model Hallucination Reduction

TL;DR

This work addresses object and text hallucination in vision-language models by identifying a logits-level property: adjacent timesteps exhibit higher distributional similarity. It introduces Cross-Temporal Prediction Connection (TPC), a training-free, plug-and-play approach that connects logits across time via Linear Temporal Prediction Connection (LTPC) or Attenuation Temporal Prediction Connection (ATPC) to improve semantic consistency without retraining. The method consistently reduces hallucinations and improves open-ended generation across datasets (e.g., POPE, MMHal-Bench, MME) and models (LLaVA, QwenVL, MiniGPT4) while maintaining efficiency, outperforming prior post-hoc techniques like VCD and DoLa. The results show notable gains in accuracy and F1 on object hallucination tasks, robust performance under hallucination prompts, and favorable qualitative cases, suggesting practical impact for reliable multimodal generation in high-stakes contexts.

Abstract

Vision-language models (VLMs) have achieved remarkable advancements, capitalizing on the impressive capabilities of large language models (LLMs) across diverse tasks. Despite this, a critical challenge known as hallucination occurs when models overconfidently describe objects or attributes absent from the image, a problem exacerbated by the tendency of VLMs to rely on linguistic priors. This limitation reduces model reliability in high-stakes applications. In this work, we have observed the characteristic of logits' continuity consistency enhancement and introduced a straightforward and efficient method, Cross-Temporal Prediction Connection (TPC), designed to enhance the semantic consistency of logits by connecting them temporally across timesteps. TPC amplifies information flow and improves coherence, effectively reducing hallucination. Extensive experiments show that TPC surpasses existing representatives, delivering superior performance in both accuracy and efficiency while maintaining robustness in open-ended text generation tasks.

Paper Structure

This paper contains 27 sections, 9 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: (a) VCD contrasts the normal logits with those obtained from inputting a noise-masked image. (b) DoLa compares each token's logits with the early-exit logits from other layers, selecting the logits with the largest Jensen-Shannon divergence for contrastive decoding. (c) Our TPC connects the logits of adjacent time steps without the need to contrast them with other logits.
  • Figure 2: The relationship between the JS divergence of logits and the distance. The $x$-axis represents the distance corresponding to the time step of the logits. The divergence of each point is obtained by calculating and averaging the logits of the output sequences of LLaVA in pairs.
  • Figure 3: A sliding window is used to validate the two different connection strategies. The $x$- axis represents the position of the segmented windows in the sequence. The further to the right, the closer the position of the window is to the logit at the current time step. The upper part of the figure shows the accuracy of the two strategies, while the lower part displays the corresponding F1 score. Regular indicates the scores from the nucleus sampling.
  • Figure 4: An illustration and overview of TPC. (a) Sequentially connecting logits across different time steps enhances and propagates temporal information. (b) Dimensionality reduction maps the logits of hallucinated tokens, regular tokens, and TPC tokens into a two-dimensional space, where TPC displays a more dispersed distribution, separated from the hallucinated tokens. (c) ATPC and LTPC.
  • Figure 5: PCA analysis maps the model’s output logits into 3D space, showing a clear separation between TPC and Regular distributions.
  • ...and 3 more figures