Table of Contents
Fetching ...

The Role of Model Confidence on Bias Effects in Measured Uncertainties for Vision-Language Models

Xinyi Liu, Weiguang Wang, Hangfeng He

TL;DR

Disentangling epistemic and aleatoric uncertainty in vision-language models under prompt biases is essential for reliable decision-making. The authors introduce bias-shuffled prompt perturbations and evaluate GPT-4o and Qwen2-VL on VL_Checklist and CREPE, using AUROC and entropy-based decomposition to quantify uncertainty. They find that bias effects intensify at lower bias-free confidence, with bias-induced underestimation of epistemic entropy (overconfidence) and weaker effects on aleatoric entropy, and that combining multiple bias mitigations yields the largest gains. The work informs bias-mitigation strategies for uncertainty quantification in multimodal models and supports approaches that explicitly separate epistemic and aleatoric sources; Entropy decomposition: $Entropy = Epistemic Entropy + P(correct) \cdot Aleatoric Entropy$.

Abstract

With the growing adoption of Large Language Models (LLMs) for open-ended tasks, accurately assessing epistemic uncertainty, which reflects a model's lack of knowledge, has become crucial to ensuring reliable outcomes. However, quantifying epistemic uncertainty in such tasks is challenging due to the presence of aleatoric uncertainty, which arises from multiple valid answers. While bias can introduce noise into epistemic uncertainty estimation, it may also reduce noise from aleatoric uncertainty. To investigate this trade-off, we conduct experiments on Visual Question Answering (VQA) tasks and find that mitigating prompt-introduced bias improves uncertainty quantification in GPT-4o. Building on prior work showing that LLMs tend to copy input information when model confidence is low, we further analyze how these prompt biases affect measured epistemic and aleatoric uncertainty across varying bias-free confidence levels with GPT-4o and Qwen2-VL. We find that all considered biases have greater effects in both uncertainties when bias-free model confidence is lower. Moreover, lower bias-free model confidence is associated with greater bias-induced underestimation of epistemic uncertainty, resulting in overconfident estimates, whereas it has no significant effect on the direction of bias effect in aleatoric uncertainty estimation. These distinct effects deepen our understanding of bias mitigation for uncertainty quantification and potentially inform the development of more advanced techniques.

The Role of Model Confidence on Bias Effects in Measured Uncertainties for Vision-Language Models

TL;DR

Disentangling epistemic and aleatoric uncertainty in vision-language models under prompt biases is essential for reliable decision-making. The authors introduce bias-shuffled prompt perturbations and evaluate GPT-4o and Qwen2-VL on VL_Checklist and CREPE, using AUROC and entropy-based decomposition to quantify uncertainty. They find that bias effects intensify at lower bias-free confidence, with bias-induced underestimation of epistemic entropy (overconfidence) and weaker effects on aleatoric entropy, and that combining multiple bias mitigations yields the largest gains. The work informs bias-mitigation strategies for uncertainty quantification in multimodal models and supports approaches that explicitly separate epistemic and aleatoric sources; Entropy decomposition: .

Abstract

With the growing adoption of Large Language Models (LLMs) for open-ended tasks, accurately assessing epistemic uncertainty, which reflects a model's lack of knowledge, has become crucial to ensuring reliable outcomes. However, quantifying epistemic uncertainty in such tasks is challenging due to the presence of aleatoric uncertainty, which arises from multiple valid answers. While bias can introduce noise into epistemic uncertainty estimation, it may also reduce noise from aleatoric uncertainty. To investigate this trade-off, we conduct experiments on Visual Question Answering (VQA) tasks and find that mitigating prompt-introduced bias improves uncertainty quantification in GPT-4o. Building on prior work showing that LLMs tend to copy input information when model confidence is low, we further analyze how these prompt biases affect measured epistemic and aleatoric uncertainty across varying bias-free confidence levels with GPT-4o and Qwen2-VL. We find that all considered biases have greater effects in both uncertainties when bias-free model confidence is lower. Moreover, lower bias-free model confidence is associated with greater bias-induced underestimation of epistemic uncertainty, resulting in overconfident estimates, whereas it has no significant effect on the direction of bias effect in aleatoric uncertainty estimation. These distinct effects deepen our understanding of bias mitigation for uncertainty quantification and potentially inform the development of more advanced techniques.

Paper Structure

This paper contains 50 sections, 7 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Uncertainty between valid answers (e.g., France and Paris) reflects aleatoric uncertainty, while uncertainty between Paris and Tokyo reflects epistemic uncertainty due to the model’s lack of knowledge.
  • Figure 2: Systematically greater overestimation of confidence in lower-confidence instances can flatten the estimated confidence curve, undermining ranking robustness. Sometimes it even reverses the correct order.
  • Figure 3: Perturb prompts to shuffle bias factors to estimate bias-free uncertainty.
  • Figure 4: Comparison of ROC curves for the text-based bias mitigation methods and baselines on two datasets using GPT-4o. The high prevalence of identical Mutual Information estimates makes it less suitable when a high abstention rate is required. The bias mitigation approach maintains robustness across different thresholds.