Table of Contents
Fetching ...

VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation

Seongheon Park, Changdae Oh, Hyeong Kyu Choi, Xuefeng Du, Sharon Li

TL;DR

VAUQ is proposed, a vision-aware uncertainty quantification framework for LVLM self-evaluation that explicitly measures how strongly a model's output depends on visual evidence and introduces the Image-Information Score, which captures the reduction in predictive uncertainty attributable to visual input.

Abstract

Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation methods rely on a model's ability to estimate the correctness of its own outputs, which can improve deployment reliability; however, they depend heavily on language priors and are therefore ill-suited for evaluating vision-conditioned predictions. We propose VAUQ, a vision-aware uncertainty quantification framework for LVLM self-evaluation that explicitly measures how strongly a model's output depends on visual evidence. VAUQ introduces the Image-Information Score (IS), which captures the reduction in predictive uncertainty attributable to visual input, and an unsupervised core-region masking strategy that amplifies the influence of salient regions. Combining predictive entropy with this core-masked IS yields a training-free scoring function that reliably reflects answer correctness. Comprehensive experiments show that VAUQ consistently outperforms existing self-evaluation methods across multiple datasets.

VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation

TL;DR

VAUQ is proposed, a vision-aware uncertainty quantification framework for LVLM self-evaluation that explicitly measures how strongly a model's output depends on visual evidence and introduces the Image-Information Score, which captures the reduction in predictive uncertainty attributable to visual input.

Abstract

Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation methods rely on a model's ability to estimate the correctness of its own outputs, which can improve deployment reliability; however, they depend heavily on language priors and are therefore ill-suited for evaluating vision-conditioned predictions. We propose VAUQ, a vision-aware uncertainty quantification framework for LVLM self-evaluation that explicitly measures how strongly a model's output depends on visual evidence. VAUQ introduces the Image-Information Score (IS), which captures the reduction in predictive uncertainty attributable to visual input, and an unsupervised core-region masking strategy that amplifies the influence of salient regions. Combining predictive entropy with this core-masked IS yields a training-free scoring function that reliably reflects answer correctness. Comprehensive experiments show that VAUQ consistently outperforms existing self-evaluation methods across multiple datasets.
Paper Structure (51 sections, 14 equations, 6 figures, 9 tables)

This paper contains 51 sections, 14 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Failure of LLM-based self-evaluation under language prior dominance. Methods include: Entropy (Ent), Verbalized Confidence (Verb), Semantic Entropy (Sem), and EigenScore (Eigen). Performance comparison on the ViLP dataset using LLaVA-1.5-7B, which contains paired factual and counterfactual images associated with the same prompt. Common self-evaluation methods often fail in counterfactual samples.
  • Figure 2: Overall VAUQ Framework. Given an input image-text pair $(\mathbf{v}, \mathbf{t})$, the LVLM generates a response $\mathbf{y}$. Based on the attention map $\mathrm{Attn}(v_i)$, we perform unsupervised core region masking by covering the top-$K\%$ image patches, resulting in a core-masked set $\mathbf{v}_{\text{masked}}$. Using this masked input, we compute the core-masked Image-Information Score $\mathrm{IS}_{\text{core}}$. Finally, predictive entropy $H(\mathbf{y}\mid \mathbf{v}, \mathbf{t})$ and $\mathrm{IS}_{\text{core}}$ are combined to produce the VAUQ score $s_{\mathrm{VAUQ}}$ for self-evaluation.
  • Figure 3: Visual attention ratios over evidence and irrelevant regions on the VisualCoT dataset.
  • Figure 4: Qualitative examples of core region masking using LLaVA-1.5-7B.
  • Figure 5: (a) Effect of the weighting parameter $\alpha$ in \ref{['equ:vauq']}; (b) effect of the proportion of masked image patches $K$ in \ref{['equ:topk']}; (c) generalization performance across datasets.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Definition 3.1: LVLM Self-Evaluator
  • Definition 4.1: Image-Information Score (IS)