Table of Contents
Fetching ...

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

Jiaqi Wang, Yifei Gao, Jitao Sang

TL;DR

This work tackles LVLM hallucinations by shifting focus from language priors to the visual encoding stage. It introduces VaLiD, a visual-layer fusion contrastive decoding method that uses uncertainty from early visual layers to construct a reference distribution and correct distorted visual information during generation. Through entropy-guided layer selection, bucketing, and an adaptive reliability constraint, VaLiD achieves state-of-the-art results on POPE, AMBER, and MME benchmarks across several LVLMs, while remaining compatible with other decoding strategies. The approach offers a practical, visual-centric path to improving the reliability of multimodal reasoning in LVLMs, with clear limitations and directions for future work.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal task reasoning. However, they often generate responses that appear plausible yet do not accurately reflect the visual content, a phenomenon known as hallucination. Recent approaches have introduced training-free methods to mitigate hallucinations by adjusting the decoding strategy during the inference stage, typically attributing hallucinations to the language model itself. Our analysis, however, reveals that distortions in the visual encoding process significantly affect the model's reasoning capabilities. Specifically, earlier visual layers may retain key features but gradually distort as the information propagates toward the output layer. Building on these insights, we propose a novel hallucination-mitigation method from the visual encoding perspective: \textbf{V}isu\textbf{a}l \textbf{L}ayer Fus\textbf{i}on Contrastive \textbf{D}ecoding (\textbf{VaLiD}). This method utilizes uncertainty to guide the visual layer selection, correcting distortions in the visual encoding process and thereby enhancing the reliability of the generated content. Experimental results demonstrate the effectiveness of VaLiD in mitigating hallucinations across various benchmarks, achieving state-of-the-art performance when compared to baseline methods. Codes are available at \href{https://github.com/RicardoLuL/VaLiD_LVLMs_hallucinations}{Github}.

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

TL;DR

This work tackles LVLM hallucinations by shifting focus from language priors to the visual encoding stage. It introduces VaLiD, a visual-layer fusion contrastive decoding method that uses uncertainty from early visual layers to construct a reference distribution and correct distorted visual information during generation. Through entropy-guided layer selection, bucketing, and an adaptive reliability constraint, VaLiD achieves state-of-the-art results on POPE, AMBER, and MME benchmarks across several LVLMs, while remaining compatible with other decoding strategies. The approach offers a practical, visual-centric path to improving the reliability of multimodal reasoning in LVLMs, with clear limitations and directions for future work.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal task reasoning. However, they often generate responses that appear plausible yet do not accurately reflect the visual content, a phenomenon known as hallucination. Recent approaches have introduced training-free methods to mitigate hallucinations by adjusting the decoding strategy during the inference stage, typically attributing hallucinations to the language model itself. Our analysis, however, reveals that distortions in the visual encoding process significantly affect the model's reasoning capabilities. Specifically, earlier visual layers may retain key features but gradually distort as the information propagates toward the output layer. Building on these insights, we propose a novel hallucination-mitigation method from the visual encoding perspective: \textbf{V}isu\textbf{a}l \textbf{L}ayer Fus\textbf{i}on Contrastive \textbf{D}ecoding (\textbf{VaLiD}). This method utilizes uncertainty to guide the visual layer selection, correcting distortions in the visual encoding process and thereby enhancing the reliability of the generated content. Experimental results demonstrate the effectiveness of VaLiD in mitigating hallucinations across various benchmarks, achieving state-of-the-art performance when compared to baseline methods. Codes are available at \href{https://github.com/RicardoLuL/VaLiD_LVLMs_hallucinations}{Github}.

Paper Structure

This paper contains 20 sections, 9 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: An example demonstrates where the LVLM provides a hallucinated response based on the features from standard visual output layer, yet correctly identifies the number of people in the image when relying on features from other visual hidden layers. The correct answer and hallucination are highlighted in blue and red, respectively.
  • Figure 2: Results of layer uncertainty and decoding accuracy of all visual hidden layers. Each node on the blue curve represents the average value of entropy across the AMBER benchmark. It should be noted that LLaVA-v1.5 utilizes the feature from the penultimate layer of CLIP-ViT as the standard visual output..
  • Figure 3: Overview of the VaLiD decoding process. At each time step, the LVLM auto-regressively samples the token $y_{t}$ based on visual input, text query, and previously generated tokens. The probability distributions represented in Green and Blue correspond to decoding results from the standard visual output layer and early visual layers, respectively. The red boxes indicate that the probability distributions of the selected visual layers, which reflect the top-k uncertainty, will be used to calculate the reference distribution. The final corrected probability distribution, shown in Red, is obtained through contrastive decoding. In this case, when asked about the number of people in the image, LVLM generates the correct decoding result, "three", instead of the incorrect "four" from the original distribution.
  • Figure 4: Results of MME benchmark. LVLMs with VaLiD achieve the best performance in 11 out of 14 categories for LLaVA-v1.5, 13 out of 14 categories for InstructBLIP, and 12 out of 14 categories for Qwen-VL. VaLiD not only mitigates hallucinations but also enhances the overall capabilities of LVLMs. Detailed results are provided in the supplementary material \ref{['sec:appendix-mme']}.
  • Figure 5: Illustration of hallucination correction by our proposed VaLiD with two examples from AMBER dataset AMBER. Hallucinated objects from vanilla decoding are highlighted in red.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Claim 1
  • Proof 1