VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

Jiaqi Wang; Yifei Gao; Jitao Sang

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

Jiaqi Wang, Yifei Gao, Jitao Sang

TL;DR

This work tackles LVLM hallucinations by shifting focus from language priors to the visual encoding stage. It introduces VaLiD, a visual-layer fusion contrastive decoding method that uses uncertainty from early visual layers to construct a reference distribution and correct distorted visual information during generation. Through entropy-guided layer selection, bucketing, and an adaptive reliability constraint, VaLiD achieves state-of-the-art results on POPE, AMBER, and MME benchmarks across several LVLMs, while remaining compatible with other decoding strategies. The approach offers a practical, visual-centric path to improving the reliability of multimodal reasoning in LVLMs, with clear limitations and directions for future work.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal task reasoning. However, they often generate responses that appear plausible yet do not accurately reflect the visual content, a phenomenon known as hallucination. Recent approaches have introduced training-free methods to mitigate hallucinations by adjusting the decoding strategy during the inference stage, typically attributing hallucinations to the language model itself. Our analysis, however, reveals that distortions in the visual encoding process significantly affect the model's reasoning capabilities. Specifically, earlier visual layers may retain key features but gradually distort as the information propagates toward the output layer. Building on these insights, we propose a novel hallucination-mitigation method from the visual encoding perspective: \textbf{V}isu\textbf{a}l \textbf{L}ayer Fus\textbf{i}on Contrastive \textbf{D}ecoding (\textbf{VaLiD}). This method utilizes uncertainty to guide the visual layer selection, correcting distortions in the visual encoding process and thereby enhancing the reliability of the generated content. Experimental results demonstrate the effectiveness of VaLiD in mitigating hallucinations across various benchmarks, achieving state-of-the-art performance when compared to baseline methods. Codes are available at \href{https://github.com/RicardoLuL/VaLiD_LVLMs_hallucinations}{Github}.

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

TL;DR

Abstract

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)

Theorems & Definitions (2)