Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA
Jian Lan, Diego Frassinelli, Barbara Plank
TL;DR
This paper investigates the gap between human responses and vision-language model predictions in Visual Question Answering (VQA) when humans exhibit uncertainty (HUD). It introduces HUD scoring and three human-centered metrics—Total Variation Distance (TVD), Kullback–Leibler divergence (KL), and Human Entropy Calibration Error (EntCE)—and evaluates calibration strategies, including Temperature Scaling with temperature $T$. The study compares BEiT3 and LXMERT on VQA 2.0, finding that BEiT3 has higher VQA-Accuracy but poorer alignment with human distributions, while calibrating toward human distributions improves alignment more effectively than calibration toward accuracy. It argues that evaluating solely on VQA-Accuracy is insufficient and that aligning models to human distributions under HUD is crucial for deploying trustworthy and human-aligned VQA systems.
Abstract
Large vision-language models frequently struggle to accurately predict responses provided by multiple human annotators, particularly when those responses exhibit human uncertainty. In this study, we focus on the Visual Question Answering (VQA) task, and we comprehensively evaluate how well the state-of-the-art vision-language models correlate with the distribution of human responses. To do so, we categorize our samples based on their levels (low, medium, high) of human uncertainty in disagreement (HUD) and employ not only accuracy but also three new human-correlated metrics in VQA, to investigate the impact of HUD. To better align models with humans, we also verify the effect of common calibration and human calibration. Our results show that even BEiT3, currently the best model for this task, struggles to capture the multi-label distribution inherent in diverse human responses. Additionally, we observe that the commonly used accuracy-oriented calibration technique adversely affects BEiT3's ability to capture HUD, further widening the gap between model predictions and human distributions. In contrast, we show the benefits of calibrating models towards human distributions for VQA, better aligning model confidence with human uncertainty. Our findings highlight that for VQA, the consistent alignment between human responses and model predictions is understudied and should become the next crucial target of future studies.
