Table of Contents
Fetching ...

Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA

Jian Lan, Diego Frassinelli, Barbara Plank

TL;DR

This paper investigates the gap between human responses and vision-language model predictions in Visual Question Answering (VQA) when humans exhibit uncertainty (HUD). It introduces HUD scoring and three human-centered metrics—Total Variation Distance (TVD), Kullback–Leibler divergence (KL), and Human Entropy Calibration Error (EntCE)—and evaluates calibration strategies, including Temperature Scaling with temperature $T$. The study compares BEiT3 and LXMERT on VQA 2.0, finding that BEiT3 has higher VQA-Accuracy but poorer alignment with human distributions, while calibrating toward human distributions improves alignment more effectively than calibration toward accuracy. It argues that evaluating solely on VQA-Accuracy is insufficient and that aligning models to human distributions under HUD is crucial for deploying trustworthy and human-aligned VQA systems.

Abstract

Large vision-language models frequently struggle to accurately predict responses provided by multiple human annotators, particularly when those responses exhibit human uncertainty. In this study, we focus on the Visual Question Answering (VQA) task, and we comprehensively evaluate how well the state-of-the-art vision-language models correlate with the distribution of human responses. To do so, we categorize our samples based on their levels (low, medium, high) of human uncertainty in disagreement (HUD) and employ not only accuracy but also three new human-correlated metrics in VQA, to investigate the impact of HUD. To better align models with humans, we also verify the effect of common calibration and human calibration. Our results show that even BEiT3, currently the best model for this task, struggles to capture the multi-label distribution inherent in diverse human responses. Additionally, we observe that the commonly used accuracy-oriented calibration technique adversely affects BEiT3's ability to capture HUD, further widening the gap between model predictions and human distributions. In contrast, we show the benefits of calibrating models towards human distributions for VQA, better aligning model confidence with human uncertainty. Our findings highlight that for VQA, the consistent alignment between human responses and model predictions is understudied and should become the next crucial target of future studies.

Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA

TL;DR

This paper investigates the gap between human responses and vision-language model predictions in Visual Question Answering (VQA) when humans exhibit uncertainty (HUD). It introduces HUD scoring and three human-centered metrics—Total Variation Distance (TVD), Kullback–Leibler divergence (KL), and Human Entropy Calibration Error (EntCE)—and evaluates calibration strategies, including Temperature Scaling with temperature . The study compares BEiT3 and LXMERT on VQA 2.0, finding that BEiT3 has higher VQA-Accuracy but poorer alignment with human distributions, while calibrating toward human distributions improves alignment more effectively than calibration toward accuracy. It argues that evaluating solely on VQA-Accuracy is insufficient and that aligning models to human distributions under HUD is crucial for deploying trustworthy and human-aligned VQA systems.

Abstract

Large vision-language models frequently struggle to accurately predict responses provided by multiple human annotators, particularly when those responses exhibit human uncertainty. In this study, we focus on the Visual Question Answering (VQA) task, and we comprehensively evaluate how well the state-of-the-art vision-language models correlate with the distribution of human responses. To do so, we categorize our samples based on their levels (low, medium, high) of human uncertainty in disagreement (HUD) and employ not only accuracy but also three new human-correlated metrics in VQA, to investigate the impact of HUD. To better align models with humans, we also verify the effect of common calibration and human calibration. Our results show that even BEiT3, currently the best model for this task, struggles to capture the multi-label distribution inherent in diverse human responses. Additionally, we observe that the commonly used accuracy-oriented calibration technique adversely affects BEiT3's ability to capture HUD, further widening the gap between model predictions and human distributions. In contrast, we show the benefits of calibrating models towards human distributions for VQA, better aligning model confidence with human uncertainty. Our findings highlight that for VQA, the consistent alignment between human responses and model predictions is understudied and should become the next crucial target of future studies.
Paper Structure (27 sections, 5 equations, 3 figures, 3 tables)

This paper contains 27 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An example of VQA 2.0. We show different people's annotation in different colors, and compare the model predictions with human distributions. Please note there are 10 annotators for each sample in VQA 2.0, we only use four in the figure as a clear and easy display, where R is human response, and C is the confidence label.
  • Figure 2: The data distributions of the two validation sets based on HUD scores. We show the two split boundaries using the black lines, the standard variation value in yellow lines, and mean values in red lines.
  • Figure 3: Case Study. The left part (A(1)-A(3)) presents a sample from the low set, displaying each response label’s VQA Accuracy score, Human HUD Distribution, and the prediction distributions of LXMERT and BEiT3. We report all three situations, where models are: not calibrated, calibrated towards VQA-Accuracy, and calibrated towards human distributions. Similarly, the right part (B(1)-B(3)) presents a sample from the high set. Human HUD distributions, which are used for measuring human-model correlation, are colored in light blue. In red we highlight unsatisfactory model performances, in light red models that slightly improved but are still not good, while dark blue indicates much better correlation with humans.