Table of Contents
Fetching ...

On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models

Hoigi Seo, Dong Un Kang, Hyunjin Cho, Joohoon Lee, Se Young Chun

TL;DR

The work identifies epistemic uncertainty in the vision encoder as a major contributor to object hallucination in LVLMs. By applying PGD-based adversarial perturbations, it derives an uncertainty mask that highlights unreliable visual tokens and then casts a training-free mitigation by masking these tokens during intermediate self-attention. Empirical results across multiple LVLMs and benchmarks show reduced hallucination while preserving caption quality, and the method remains compatible with existing decoding- and attention-based defenses. The approach emphasizes VE-focused improvements for reliability and suggests broad applicability, albeit with some limitations for architectures like Q-Former-based MiniGPT-4.

Abstract

Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.

On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models

TL;DR

The work identifies epistemic uncertainty in the vision encoder as a major contributor to object hallucination in LVLMs. By applying PGD-based adversarial perturbations, it derives an uncertainty mask that highlights unreliable visual tokens and then casts a training-free mitigation by masking these tokens during intermediate self-attention. Empirical results across multiple LVLMs and benchmarks show reduced hallucination while preserving caption quality, and the method remains compatible with existing decoding- and attention-based defenses. The approach emphasizes VE-focused improvements for reliability and suggests broad applicability, albeit with some limitations for architectures like Q-Former-based MiniGPT-4.

Abstract

Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.

Paper Structure

This paper contains 70 sections, 4 theorems, 18 equations, 32 figures, 17 tables.

Key Result

Lemma 3.1

Let $f = \{f_t\}_{t=1}^L$ be a smooth $L$-layer neural network parameterized by $\theta$. For an input $x \in \mathbb{R}^{N \times 3}$, define the hidden state at layer $t$ as $z^{(t)} = f_t \circ \cdots \circ f_1(x)$. For a perturbed input $x+\epsilon$, with $\|\epsilon\|_\infty \le k$ for sufficie

Figures (32)

  • Figure 1: Overall illustration of the adversarial attack and uncertainty mask generation process.(a) The original image is processed by the vision encoder (VE) to obtain features $f_{\text{orig}}$. An adversarial image is created by adding optimizable noise, which is then encoded to produce $f_{\text{attk}}$. The noise is optimized using Projected Gradient Descent (PGD) to maximize the mean squared error between $f_{\text{orig}}$ and $f_{\text{attk}}$, as described in Eq. \ref{['eq:pgd']}. (b) From layers $i$ to $j-1$, we extract feature sets $\mathcal{F}_{\text{orig}}=\{f_{\text{orig}}^{i},\dots,f_{\text{orig}}^{j-1}\}$ and $\mathcal{F}_{\text{attk}}=\{f_{\text{attk}}^{i},\dots,f_{\text{attk}}^{j-1}\}$. The norm differences of corresponding features form layer-wise uncertainty maps $\mathcal{U}=\{u^{i},\dots,u^{j-1}\}$. These maps are min-max normalized, aggregated, and standardized to produce the final binary uncertainty mask $M$ using a threshold $\sigma_{\text{th}}$.
  • Figure 2: Visual comparison of estimated uncertainty from MC dropout mukhoti2018evaluating and our method. Our uncertainty map $U$ identifies uncertain regions similar to the uncertainty map obtained via MC dropout. MC dropout was applied to the residuals of the LLaVA-1.5 vision encoder with a dropout rate of $p=0.5$ and the variance of each token was estimated over 1,000 forward passes. For the adversarial attack, we applied 100 iterations of PGD with $k=3$. The MC-based uncertainty values were log-scaled for visualization clarity. The runtime for each example is shown in the top-left corner.
  • Figure 3: Relative deviation between attacked and original features. We used 500 images from the MS-COCO lin2014microsoft with LLaVA-1.5 vision encoder liu2024improved. Perturbations introduced through the vision encoder remain minimal in early layers but intensify in later ones. We extract the mask from early layers where feature deviations are comparatively small. Error bars denote the $2\sigma$ range.
  • Figure 4: Relationship between uncertain visual tokens and object hallucination. The $x$-axis represents the average variance within each bin, while the $y$-axis shows the corresponding metric scores. The results indicate that higher uncertainty is associated with more object hallucination, with $p$-value $<0.05$. The trend line was fitted with quadratic function. Note that higher values of CHAIR$_s$ and CHAIR$_i$, and lower F1 score indicate more severe object hallucinations.
  • Figure 5: Illustration of our attention masking method during inference. In the intermediate multi-head self-attention layers of the vision encoder, we apply a binary uncertainty mask $M$ to the attention outputs. This token-wise masking reduces the influence of uncertain visual tokens, while preserving the meaningful visual representation.
  • ...and 27 more figures

Theorems & Definitions (6)

  • Lemma 3.1: Approximate local Gaussianity under small perturbation
  • Theorem 3.2: Upper bound of differential entropy increases as hidden state deviation increases under adversarial attack
  • Lemma 3.1: Approximate local Gaussianity under small perturbation
  • proof
  • Theorem 3.2: Upper bound of differential entropy increases as hidden state deviation increases under adversarial attack
  • proof