Calibrated Uncertainty Sampling for Active Learning
Ha Manh Bui, Iliana Maifeld-Carucci, Anqi Liu
TL;DR
The paper addresses calibration gaps in uncertainty-based pool-based active learning by introducing Calibrated Uncertainty Sampling for AL (CUSAL), which first estimates per-sample calibration error on the unlabeled pool under covariate shift using a kernel Dirichlet estimator and then selects samples in a lexicographic order that prioritizes reducing calibration error before pursuing uncertainty. The authors establish a pointwise-consistency bound for the calibration estimator and derive bounds on both unlabeled-pool and unseen-data calibration errors, showing improved reliability as more labeled and unlabeled data accrue. Empirically, CUSAL consistently yields lower Expected Calibration Error and higher accuracy across MNIST, Fashion-MNIST, SVHN, CIFAR-10, CIFAR-10-LT, and ImageNet, with ablations underscoring the value of the lexicographic strategy and showing potential gains from hybrid diversity-uncertainty extensions. The work advances trustworthy active learning by enabling better uncertainty quantification without requiring hold-out recalibration, with practical impact for safety-critical deployments and scalable learning scenarios.
Abstract
We study the problem of actively learning a classifier with a low calibration error. One of the most popular Acquisition Functions (AFs) in pool-based Active Learning (AL) is querying by the model's uncertainty. However, we recognize that an uncalibrated uncertainty model on the unlabeled pool may significantly affect the AF effectiveness, leading to sub-optimal generalization and high calibration error on unseen data. Deep Neural Networks (DNNs) make it even worse as the model uncertainty from DNN is usually uncalibrated. Therefore, we propose a new AF by estimating calibration errors and query samples with the highest calibration error before leveraging DNN uncertainty. Specifically, we utilize a kernel calibration error estimator under the covariate shift and formally show that AL with this AF eventually leads to a bounded calibration error on the unlabeled pool and unseen test data. Empirically, our proposed method surpasses other AF baselines by having a lower calibration and generalization error across pool-based AL settings.
