Table of Contents
Fetching ...

Human Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question Answering

Jian Lan, Zhicheng Liu, Udo Schlegel, Raoyuan Zhao, Yihong Liu, Hinrich Schütze, Michael A. Hedderich, Thomas Seidl

TL;DR

The paper addresses the cost and calibration issues of supervised fine-tuning for visual question answering by introducing human uncertainty (HU) as a productive training signal. It formalizes HU through HaConf and HUD, proposes a HU-aware evaluation (HU-acc), and presents HaDola, a four-stage data selection and automatic labeling framework that starts from a small HU-annotated seed to iteratively identify informative samples, generate pseudo-labels, and calibrate predictions toward human uncertainty. HaDola demonstrates improved accuracy and calibration on VQAv2 and VizWiz with substantially less annotated data, and ablation studies confirm the necessity of each component. The work shows that explicitly modeling HU can yield more efficient, human-aligned VLMs and suggests that selective, HU-informed data usage is more impactful than mere dataset scaling.

Abstract

Large vision-language models (VLMs) achieve strong performance in Visual Question Answering but still rely heavily on supervised fine-tuning (SFT) with massive labeled datasets, which is costly due to human annotations. Crucially, real-world datasets often exhibit human uncertainty (HU) -- variation in human confidence across annotations -- but standard SFT simply optimizes toward the most frequent label, disregarding HU distributions. This leaves two open questions: How does HU affect SFT, and how can HU be effectively leveraged in training? In this work, we first conduct a systematic evaluation of VLMs across varying HU levels. We have two key findings: (i) surprisingly, high-HU samples contribute little or even degrade model performance, and (ii) naively training on the full dataset yields under-calibrated models that fail to capture HU distributions. Motivated by these findings, we introduce HaDola, a human uncertainty-aware data selection and automatic labeling framework. HaDola operates in four stages -- discriminate, self-annotate, error trigger, and training -- to iteratively identify harmful samples, prioritize informative ones, and bootstrap from a small seed set (5\% of data). Our approach substantially reduces reliance on costly HU annotations and makes VLMs more accurate and better calibrated. Extensive experiments on VQAv2 and VizWiz datasets demonstrate that HaDola consistently matches or outperforms state-of-the-art baselines with less training data. Our work highlights the importance of explicitly modeling HU in SFT, suggesting that better utilization of HU is more effective than merely scaling up dataset size.

Human Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question Answering

TL;DR

The paper addresses the cost and calibration issues of supervised fine-tuning for visual question answering by introducing human uncertainty (HU) as a productive training signal. It formalizes HU through HaConf and HUD, proposes a HU-aware evaluation (HU-acc), and presents HaDola, a four-stage data selection and automatic labeling framework that starts from a small HU-annotated seed to iteratively identify informative samples, generate pseudo-labels, and calibrate predictions toward human uncertainty. HaDola demonstrates improved accuracy and calibration on VQAv2 and VizWiz with substantially less annotated data, and ablation studies confirm the necessity of each component. The work shows that explicitly modeling HU can yield more efficient, human-aligned VLMs and suggests that selective, HU-informed data usage is more impactful than mere dataset scaling.

Abstract

Large vision-language models (VLMs) achieve strong performance in Visual Question Answering but still rely heavily on supervised fine-tuning (SFT) with massive labeled datasets, which is costly due to human annotations. Crucially, real-world datasets often exhibit human uncertainty (HU) -- variation in human confidence across annotations -- but standard SFT simply optimizes toward the most frequent label, disregarding HU distributions. This leaves two open questions: How does HU affect SFT, and how can HU be effectively leveraged in training? In this work, we first conduct a systematic evaluation of VLMs across varying HU levels. We have two key findings: (i) surprisingly, high-HU samples contribute little or even degrade model performance, and (ii) naively training on the full dataset yields under-calibrated models that fail to capture HU distributions. Motivated by these findings, we introduce HaDola, a human uncertainty-aware data selection and automatic labeling framework. HaDola operates in four stages -- discriminate, self-annotate, error trigger, and training -- to iteratively identify harmful samples, prioritize informative ones, and bootstrap from a small seed set (5\% of data). Our approach substantially reduces reliance on costly HU annotations and makes VLMs more accurate and better calibrated. Extensive experiments on VQAv2 and VizWiz datasets demonstrate that HaDola consistently matches or outperforms state-of-the-art baselines with less training data. Our work highlights the importance of explicitly modeling HU in SFT, suggesting that better utilization of HU is more effective than merely scaling up dataset size.

Paper Structure

This paper contains 31 sections, 22 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Illustration of human uncertainty in VQA. Different annotators may provide different answers with varying confidence levels. The HaConf Score is the average score of all annotators for each answer \ref{['eq1']}. On the right, it shows the VQA-Accuracy \ref{['vqa-acc-cal']} of different model generations. The metric fails to reflect the HU difference, and leaves an open question: Is that an accurate metric to evaluate VLMs?
  • Figure 2: Results on VQAv2, (a): comparison of data distribution with different set splits design. (b): Effects of training samples with different HU degrees, and comparison with VQA-acc. The L, M, H stands for low, medium, and high. We downsample the L and M subsets to match the sample size of H. VizWiz results reach consistent findings and are in Appendix \ref{['app_viz-hu']} due to page limitation.
  • Figure 3: Heatmap comparison of different training methods across four backbone models on VQAv2 and VizWiz, under HU-acc and KL. For HU-acc, darker colors indicate higher accuracy, while for KL, lighter colors indicate smaller divergence. Red boxes highlight the best-performing method for each model. Our method (HaDola) consistently achieves competitive or superior performance across both datasets.
  • Figure 4: Training dynamics of HaDola. We observe an S-shaped improvement: a rapid increase with 5--10% labels, a slower gain between 10--15%, and convergence after 15%.
  • Figure 5: Radar charts of SFT performances across VQAv2 different training and validation subsets with varying HU levels. The upper row shows HU-acc (higher is better), and the lower row shows KL divergence (lower is better) for five models. The three training settings are distinguished by line styles and colors: Train/Val-L, -M, -H means Training or Validation on Low, Medium, or High HU subsets.
  • ...and 5 more figures