Table of Contents
Fetching ...

ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

Zifu Wan, Ce Zhang, Silong Yong, Martin Q. Ma, Simon Stepputtis, Louis-Philippe Morency, Deva Ramanan, Katia Sycara, Yaqi Xie

TL;DR

This work tackles the persistent hallucination problem in large vision-language models by introducing ONLY, a training-free decoding method that adds a single TE-MHA layer and uses a text-to-visual entropy ratio to bias attention toward textual information. An adaptive decoding scheme then fuses the textual-enhanced logits with the original predictions, switching between collaborative and contrastive modes based on a token-wise distribution distance. Across three LVLM backbones and multiple benchmarks, ONLY achieves state-of-the-art improvements with minimal computational overhead, demonstrating strong practical potential for real-time deployment. Ablation studies validate the importance of TVER-based head selection, layer placement, and hyperparameter choices, and scalability tests show benefits extend to larger model variants like 13B LLaVA-1.5.

Abstract

Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications. Existing work has explored contrastive decoding approaches to mitigate this issue, where the output of the original LVLM is compared and contrasted with that of a perturbed version. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. Specifically, we enhance textual outputs by selectively amplifying crucial textual information using a text-to-visual entropy ratio for each token. Extensive experimental results demonstrate that our proposed ONLY consistently outperforms state-of-the-art methods across various benchmarks while requiring minimal implementation effort and computational cost. Code is available at https://github.com/zifuwan/ONLY.

ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

TL;DR

This work tackles the persistent hallucination problem in large vision-language models by introducing ONLY, a training-free decoding method that adds a single TE-MHA layer and uses a text-to-visual entropy ratio to bias attention toward textual information. An adaptive decoding scheme then fuses the textual-enhanced logits with the original predictions, switching between collaborative and contrastive modes based on a token-wise distribution distance. Across three LVLM backbones and multiple benchmarks, ONLY achieves state-of-the-art improvements with minimal computational overhead, demonstrating strong practical potential for real-time deployment. Ablation studies validate the importance of TVER-based head selection, layer placement, and hyperparameter choices, and scalability tests show benefits extend to larger model variants like 13B LLaVA-1.5.

Abstract

Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications. Existing work has explored contrastive decoding approaches to mitigate this issue, where the output of the original LVLM is compared and contrasted with that of a perturbed version. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. Specifically, we enhance textual outputs by selectively amplifying crucial textual information using a text-to-visual entropy ratio for each token. Extensive experimental results demonstrate that our proposed ONLY consistently outperforms state-of-the-art methods across various benchmarks while requiring minimal implementation effort and computational cost. Code is available at https://github.com/zifuwan/ONLY.

Paper Structure

This paper contains 36 sections, 19 equations, 8 figures, 16 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparisons of accuracy and inference speed of multiple hallucination mitigation approaches. The size of bubbles stands for the GPU memory consumption. Our method effectively mitigates hallucination with only 0.07$\times$ extra time.
  • Figure 2: Overview of our proposed ONLY. Our method retains the core decoding process of LVLMs but incorporates a textual-enhanced multi-head attention layer with a residual connection to the last layer's output. This adjustment aims to produce an output with a greater focus on textual information. The resulting textual-enhanced logits are then adaptively decoded alongside the original output, employing either contrastive or collaborative decoding strategies to optimize performance.
  • Figure 3: Impact of applying diffusion noise on textual and visual attention entropy. We perform an analysis on all COCO samples from the POPE benchmark and observe that as distortion increases, textual entropy rises whereas visual entropy decreases.
  • Figure 4: Text-to-visual entropy ratio is correlated with hallucinations. (Left) Density plot of token-wise average textual-to-visual entropy ratio and bar plot of average $\text{CHAIR}_I$ in each bin on the CHAIR benchmark; (Right) Density plots of token-level Manhattan distance between original and textual-enhanced logits for both hallucinatory and non-hallucinatory tokens on POPE.
  • Figure 5: Results on MMVP tong2024eyes. We apply our approach to LLaVA-1.5 liu2024improved and compare its performance against other hallucination mitigation methods.
  • ...and 3 more figures