Table of Contents
Fetching ...

Improving Deep Ensembles by Estimating Confusion Matrices

Danil Kuzin, Olga Isupova, Steven Reece, Brooke D Simmons

TL;DR

This work addresses the problem of aggregating predictions from deep ensembles more effectively than simple averaging by learning per-model confusion patterns. It introduces Soft Dawid Skene (SDS), an EM-based method that treats each ensemble member as a classifier with its own confusion matrix and uses soft outputs via Dirichlet priors to weigh predictions. SDS learns these confusion matrices and class frequencies from unlabeled data, improving accuracy, calibration (ECE, Brier, NLL), and OOD detection on distributional-shift benchmarks across MNIST, CIFAR-10/100, and ImageNet. The approach demonstrates competitive or superior performance to EA without requiring ground-truth labels for aggregation, and offers an online variant for streaming scenarios, highlighting practical impact for deploying well-calibrated, reliable ensemble predictions.

Abstract

Ensembling in deep learning improves accuracy and calibration over single networks. The traditional aggregation approach, ensemble averaging, treats all individual networks equally by averaging their outputs. Inspired by crowdsourcing we propose an aggregation method called soft Dawid Skene for deep ensembles that estimates confusion matrices of ensemble members and weighs them according to their inferred performance. Soft Dawid Skene aggregates soft labels in contrast to hard labels often used in crowdsourcing. We empirically show the superiority of soft Dawid Skene in accuracy, calibration and out of distribution detection in comparison to ensemble averaging in extensive experiments.

Improving Deep Ensembles by Estimating Confusion Matrices

TL;DR

This work addresses the problem of aggregating predictions from deep ensembles more effectively than simple averaging by learning per-model confusion patterns. It introduces Soft Dawid Skene (SDS), an EM-based method that treats each ensemble member as a classifier with its own confusion matrix and uses soft outputs via Dirichlet priors to weigh predictions. SDS learns these confusion matrices and class frequencies from unlabeled data, improving accuracy, calibration (ECE, Brier, NLL), and OOD detection on distributional-shift benchmarks across MNIST, CIFAR-10/100, and ImageNet. The approach demonstrates competitive or superior performance to EA without requiring ground-truth labels for aggregation, and offers an online variant for streaming scenarios, highlighting practical impact for deploying well-calibrated, reliable ensemble predictions.

Abstract

Ensembling in deep learning improves accuracy and calibration over single networks. The traditional aggregation approach, ensemble averaging, treats all individual networks equally by averaging their outputs. Inspired by crowdsourcing we propose an aggregation method called soft Dawid Skene for deep ensembles that estimates confusion matrices of ensemble members and weighs them according to their inferred performance. Soft Dawid Skene aggregates soft labels in contrast to hard labels often used in crowdsourcing. We empirically show the superiority of soft Dawid Skene in accuracy, calibration and out of distribution detection in comparison to ensemble averaging in extensive experiments.

Paper Structure

This paper contains 36 sections, 8 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: True confusion matrices for two ensemble members on the MNIST distributional shift experiment with rotation of $60^{\circ}$.
  • Figure 2: Results on MNIST with rotations of increasing angle: (\ref{['experiments:mnist_acc']}) accuracy, (\ref{['experiments:mnist_ece']}) ECE, (\ref{['experiments:mnist_brier']}) Brier score, (\ref{['experiments:mnist_nll']}) NLL. Here, several methods demonstrate similar performance, and some lines are not visible. Thus, all methods have similar accuracy and DS and BCC methods have similar ECE and Brier score.
  • Figure 3: Results on CIFAR10 with Frosted Glass Blur corruption of increasing severity: (\ref{['experiments:corrupted_cifar10_acc']}) accuracy, (\ref{['experiments:corrupted_cifar10_ece']}) ECE, (\ref{['experiments:corrupted_cifar10_brier']}) Brier score, (\ref{['experiments:corrupted_cifar10_nll']}) NLL.
  • Figure 4: Results on CIFAR100 with Brightness corruption of increasing severity: (\ref{['experiments:corrupted_cifar100_acc']}) accuracy, (\ref{['experiments:corrupted_cifar100_ece']}) ECE, (\ref{['experiments:corrupted_cifar100_brier']}) Brier score, (\ref{['experiments:corrupted_cifar100_nll']}) NLL.
  • Figure 5: ECE ($\downarrow$) results on ImageNet with Zoom Blur corruption of increasing severity. The other 3 metrics have similar results for both methods.
  • ...and 5 more figures