Table of Contents
Fetching ...

JUCAL: Jointly Calibrating Aleatoric and Epistemic Uncertainty in Classification Tasks

Jakob Heiss, Sören Lambrecht, Jakob Weissteiner, Hanna Wutte, Žan Žurič, Josef Teichmann, Bin Yu

TL;DR

JUCAL addresses the misbalance between aleatoric and epistemic uncertainty in ensemble classifications by jointly calibrating both via two tunable constants, optimized using negative log-likelihood on a calibration set. It extends temperature scaling with a diversity-adjustment parameter to modulate ensemble disagreement, enabling input-conditioned uncertainty that better reflects data noise and model uncertainty. Empirical results across text and image domains show consistent improvements over pool-then-calibrate and uncalibrated baselines, with up to 15% NLL reduction and substantial reductions in predictive-set size, while enabling smaller ensembles to outperform larger calibrated ones and cutting inference costs. The method is simple, model-agnostic, and easily integrable into existing pipelines, offering practical gains in reliability and efficiency for real-world classification tasks.

Abstract

We study post-calibration uncertainty for trained ensembles of classifiers. Specifically, we consider both aleatoric (label noise) and epistemic (model) uncertainty. Among the most popular and widely used calibration methods in classification are temperature scaling (i.e., pool-then-calibrate) and conformal methods. However, the main shortcoming of these calibration methods is that they do not balance the proportion of aleatoric and epistemic uncertainty. Not balancing these uncertainties can severely misrepresent predictive uncertainty, leading to overconfident predictions in some input regions while being underconfident in others. To address this shortcoming, we present a simple but powerful calibration algorithm Joint Uncertainty Calibration (JUCAL) that jointly calibrates aleatoric and epistemic uncertainty. JUCAL jointly calibrates two constants to weight and scale epistemic and aleatoric uncertainties by optimizing the negative log-likelihood (NLL) on the validation/calibration dataset. JUCAL can be applied to any trained ensemble of classifiers (e.g., transformers, CNNs, or tree-based methods), with minimal computational overhead, without requiring access to the models' internal parameters. We experimentally evaluate JUCAL on various text classification tasks, for ensembles of varying sizes and with different ensembling strategies. Our experiments show that JUCAL significantly outperforms SOTA calibration methods across all considered classification tasks, reducing NLL and predictive set size by up to 15% and 20%, respectively. Interestingly, even applying JUCAL to an ensemble of size 5 can outperform temperature-scaled ensembles of size up to 50 in terms of NLL and predictive set size, resulting in up to 10 times smaller inference costs. Thus, we propose JUCAL as a new go-to method for calibrating ensembles in classification.

JUCAL: Jointly Calibrating Aleatoric and Epistemic Uncertainty in Classification Tasks

TL;DR

JUCAL addresses the misbalance between aleatoric and epistemic uncertainty in ensemble classifications by jointly calibrating both via two tunable constants, optimized using negative log-likelihood on a calibration set. It extends temperature scaling with a diversity-adjustment parameter to modulate ensemble disagreement, enabling input-conditioned uncertainty that better reflects data noise and model uncertainty. Empirical results across text and image domains show consistent improvements over pool-then-calibrate and uncalibrated baselines, with up to 15% NLL reduction and substantial reductions in predictive-set size, while enabling smaller ensembles to outperform larger calibrated ones and cutting inference costs. The method is simple, model-agnostic, and easily integrable into existing pipelines, offering practical gains in reliability and efficiency for real-world classification tasks.

Abstract

We study post-calibration uncertainty for trained ensembles of classifiers. Specifically, we consider both aleatoric (label noise) and epistemic (model) uncertainty. Among the most popular and widely used calibration methods in classification are temperature scaling (i.e., pool-then-calibrate) and conformal methods. However, the main shortcoming of these calibration methods is that they do not balance the proportion of aleatoric and epistemic uncertainty. Not balancing these uncertainties can severely misrepresent predictive uncertainty, leading to overconfident predictions in some input regions while being underconfident in others. To address this shortcoming, we present a simple but powerful calibration algorithm Joint Uncertainty Calibration (JUCAL) that jointly calibrates aleatoric and epistemic uncertainty. JUCAL jointly calibrates two constants to weight and scale epistemic and aleatoric uncertainties by optimizing the negative log-likelihood (NLL) on the validation/calibration dataset. JUCAL can be applied to any trained ensemble of classifiers (e.g., transformers, CNNs, or tree-based methods), with minimal computational overhead, without requiring access to the models' internal parameters. We experimentally evaluate JUCAL on various text classification tasks, for ensembles of varying sizes and with different ensembling strategies. Our experiments show that JUCAL significantly outperforms SOTA calibration methods across all considered classification tasks, reducing NLL and predictive set size by up to 15% and 20%, respectively. Interestingly, even applying JUCAL to an ensemble of size 5 can outperform temperature-scaled ensembles of size up to 50 in terms of NLL and predictive set size, resulting in up to 10 times smaller inference costs. Thus, we propose JUCAL as a new go-to method for calibrating ensembles in classification.
Paper Structure (59 sections, 29 equations, 24 figures, 11 tables, 4 algorithms)

This paper contains 59 sections, 29 equations, 24 figures, 11 tables, 4 algorithms.

Figures (24)

  • Figure 1: Predictive probability estimation for a synthetic 2D binary classification task. (a) Softmax outputs from a single NN. (b) Deep Ensemble. (c) & (d) show the same ensemble as in (b) but with different calibration algorithms applied to it. In all cases, the uncertainty peaks near the decision boundary, but only JUCAL sufficiently accounts for epistemic uncertainty by widening the uncertain region (bright colors) as the distance to the training data increases. This reflects the model's limited knowledge in data-sparse regions, highlighting the ensemble's ability to distinguish between aleatory and epistemic components.
  • Figure 2: Scatter plots of ensemble members' softmax outputs for (a) binary ($K=2)$ and (b-e) ternary ($K=3$) classification. Each subplot shows a different possibility of how the $M=50$ predictions could be arranged for a fixed input point $x$. Each point represents a probability vector $p(y|x,\theta_m)$ over $K$ classes estimated by an ensemble member. (a)&(b) low total predictive uncertainty; (c) very high aleatoric and low epistemic uncertainty; (d) low aleatoric and very high epistemic uncertainty; (d)&(e) high epistemic uncertainty. Theoretically (d) claims that the aleatoric uncertainty is certainly low, while (e) is uncertain about the aleatoric uncertainty, but in practice, both (d)&(e) should usually be simply interpreted as high epistemic uncertainty (see \ref{['rem:UniformVsCornersMoreEpsitemic']}).
  • Figure 3: Binary classification example with $X\sim\mathcal{N}(0,1)$. The ensemble logits strongly agree in the center of the distribution $x\in[-2,2]$, but disagree more as one moves away from the center. The two subplots show the same ensemble before and after applying JUCAL to it.
  • Figure 4: Text Classification Results. For each of the six subplots, lower values of the metrics (displayed on the y-axis) are better. On the x-axis, we list 12 text classification datasets (a 10%-mini and a 100%-full version of 6 distinct datasets). The striped bars correspond to ensemble size $M=5$, while the non-striped bars correspond to $M=50$. JUCAL's results are yellow. For all six metrics (defined in \ref{['sec:MetricsAndBenchmarks']}), we show the average and $\pm1$ standard deviation across 5 random validation-test splits. (a) NLL normalized by the mean of JUCAL Greedy-50 on the corresponding full dataset; (b) $\text{AORAC}=1-\text{AURAC}$; (c) $\text{AOROC}=1-\text{AUROC}$; (d) Average set size for the coverage threshold of 99.9% for DBpedia (Full and Mini) and 99% for all other datasets; (e) Brier Score; (f) $\text{Misclassification Rate}=1-\text{Accuracy}$. For more detailed results, see the corresponding tables in \ref{['appendix:sec:FurhterResutls']}.
  • Figure 5: Image Classification Results. For each of the six subplots, lower values of the metrics (displayed on the y-axis) are better. On the x-axis, we list distinct image classification datasets (and two hyperparameter-ablation studies for MNIST). JUCAL's results are yellow. For all six metrics (defined in \ref{['sec:MetricsAndBenchmarks']}), we show the average and $\pm1$ standard deviation across 10 random train-validation-test splits. (a) NLL normalized by the mean of JUCAL Greedy-5; (b) $\text{AORAC}=1-\text{AURAC}$; (c) $\text{AOROC}=1-\text{AUROC}$; (d) Average set size for the coverage threshold of 99% for CIFAR-10, 90% for CIFAR-100, and 99.9% for al variants of MNIST and Fashion-MNIST; (e) Brier Score; (f) $\text{Misclassification Rate}=1-\text{Accuracy}$.
  • ...and 19 more figures

Theorems & Definitions (9)

  • Remark A.1: Uniform over the Simplex vs. Corners of the Simplex
  • Remark A.2: Ensembles as Bayesian approximation
  • Remark A.3: Applying JUCAL to Bayeisan methods
  • Remark A.4: Bayesian version of \ref{['rem:UniformVsCornersMoreEpsitemic']}
  • Remark A.5: \ref{['tab:ReducingUncertainty']} should be understood on average
  • Example A.6: Electronic component
  • Example A.7: Similar example for a more generic prior
  • Remark A.8: How do different algorithms deal with \ref{['ex:GenericPriorExtensionElectronicCOmponent']}
  • Example I.1: Classification with Unbalanced Groups