Table of Contents
Fetching ...

Mitigating Hallucinations in Large Vision-Language Models by Adaptively Constraining Information Flow

Jiaqi Bai, Hongcheng Guo, Zhongyuan Peng, Jian Yang, Zhoujun Li, Mohan Li, Zhihong Tian

TL;DR

This paper tackles object hallucinations in large vision-language models (LVLMs), which arise from overconfidence in irrelevant visual features that map to the LLM embedding space. It proposes AdaVIB, an Adaptive Variational Information Bottleneck that injects stochastic noise into soft visual tokens and uses an entropy-based schedule to adapt the noise level per sample, effectively compressing irrelevant information while preserving predictive content. The method is lightweight and trains only the vision-language projector; when applied to MiniGPT-4 and LLaVa-1.5, it yields consistent reductions in CHAIR_S/CHAIR_I on MSCOCO and improves POPE metrics, outperforming strong baselines and ablations demonstrate the importance of the adaptive $\beta$ and reparameterization. This work meaningfully improves the reliability of LVLMs by narrowing the modality gap between visual patterns and language.

Abstract

Large vision-language models show tremendous potential in understanding visual information through human languages. However, they are prone to suffer from object hallucination, i.e., the generated image descriptions contain objects that do not exist in the image. In this paper, we reveal that object hallucination can be attributed to overconfidence in irrelevant visual features when soft visual tokens map to the LLM's word embedding space. Specifically, by figuring out the semantic similarity between visual tokens and LLM's word embedding, we observe that the smoothness of similarity distribution strongly correlates with the emergence of object hallucinations. To mitigate hallucinations, we propose using the Variational Information Bottleneck (VIB) to alleviate overconfidence by introducing stochastic noise, facilitating the constraining of irrelevant information. Furthermore, we propose an entropy-based noise-controlling strategy to enable the injected noise to be adaptively constrained regarding the smoothness of the similarity distribution. We adapt the proposed AdaVIB across distinct model architectures. Experimental results demonstrate that the proposed AdaVIB mitigates object hallucinations by effectively alleviating the overconfidence in irrelevant visual features, with consistent improvements on two object hallucination benchmarks.

Mitigating Hallucinations in Large Vision-Language Models by Adaptively Constraining Information Flow

TL;DR

This paper tackles object hallucinations in large vision-language models (LVLMs), which arise from overconfidence in irrelevant visual features that map to the LLM embedding space. It proposes AdaVIB, an Adaptive Variational Information Bottleneck that injects stochastic noise into soft visual tokens and uses an entropy-based schedule to adapt the noise level per sample, effectively compressing irrelevant information while preserving predictive content. The method is lightweight and trains only the vision-language projector; when applied to MiniGPT-4 and LLaVa-1.5, it yields consistent reductions in CHAIR_S/CHAIR_I on MSCOCO and improves POPE metrics, outperforming strong baselines and ablations demonstrate the importance of the adaptive and reparameterization. This work meaningfully improves the reliability of LVLMs by narrowing the modality gap between visual patterns and language.

Abstract

Large vision-language models show tremendous potential in understanding visual information through human languages. However, they are prone to suffer from object hallucination, i.e., the generated image descriptions contain objects that do not exist in the image. In this paper, we reveal that object hallucination can be attributed to overconfidence in irrelevant visual features when soft visual tokens map to the LLM's word embedding space. Specifically, by figuring out the semantic similarity between visual tokens and LLM's word embedding, we observe that the smoothness of similarity distribution strongly correlates with the emergence of object hallucinations. To mitigate hallucinations, we propose using the Variational Information Bottleneck (VIB) to alleviate overconfidence by introducing stochastic noise, facilitating the constraining of irrelevant information. Furthermore, we propose an entropy-based noise-controlling strategy to enable the injected noise to be adaptively constrained regarding the smoothness of the similarity distribution. We adapt the proposed AdaVIB across distinct model architectures. Experimental results demonstrate that the proposed AdaVIB mitigates object hallucinations by effectively alleviating the overconfidence in irrelevant visual features, with consistent improvements on two object hallucination benchmarks.

Paper Structure

This paper contains 27 sections, 11 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Impact on the smoothness of the similarity distribution correlates with the emergence of object hallucinations. We use the normalized dot product to measure the semantic similarity between soft visual tokens and LLM's word embedding. The y-axis of the similarity distribution denotes the similarity score. The x-axis is the top-ranked LLM's token sorting in descending order.
  • Figure 2: The model architecture of AdaVIB. AdaVIB compresses the input representations $\mathbf{v}$ into soft visual tokens $\mathbf{z}$ with mean $\mu _{\theta}(\mathbf{v})$ and constrain the irrelevant information by injecting the Gaussian noise with variance $\Sigma _{\theta}(\mathbf{v})$.
  • Figure 3: Distribution of the max similarity score between soft visual tokens and LLM's word embedding. We use MiniGPT4 as the model backbone. The x-axis denotes the normalized similarity score, ranging from 0-1. The y-axis denotes the proportion of hallucinated samples in a specific range to overall hallucinated samples.
  • Figure 4: Correlation between the KL loss (Equation \ref{['kl_loss']}) and the similarity entropy (Equation \ref{['sim_entropy']}) over the course of training. All curves are smoothed by exponential moving average for better understanding the tendency.
  • Figure 5: Impact of different $\beta$ on object hallucinations. Figure \ref{['minigpt4_impact']} and Figure \ref{['llava_impact']} present the results that leverages MiniGPT4 and LLaVa-1.5 as backbone, respectively.