Improving Deep Ensembles by Estimating Confusion Matrices
Danil Kuzin, Olga Isupova, Steven Reece, Brooke D Simmons
TL;DR
This work addresses the problem of aggregating predictions from deep ensembles more effectively than simple averaging by learning per-model confusion patterns. It introduces Soft Dawid Skene (SDS), an EM-based method that treats each ensemble member as a classifier with its own confusion matrix and uses soft outputs via Dirichlet priors to weigh predictions. SDS learns these confusion matrices and class frequencies from unlabeled data, improving accuracy, calibration (ECE, Brier, NLL), and OOD detection on distributional-shift benchmarks across MNIST, CIFAR-10/100, and ImageNet. The approach demonstrates competitive or superior performance to EA without requiring ground-truth labels for aggregation, and offers an online variant for streaming scenarios, highlighting practical impact for deploying well-calibrated, reliable ensemble predictions.
Abstract
Ensembling in deep learning improves accuracy and calibration over single networks. The traditional aggregation approach, ensemble averaging, treats all individual networks equally by averaging their outputs. Inspired by crowdsourcing we propose an aggregation method called soft Dawid Skene for deep ensembles that estimates confusion matrices of ensemble members and weighs them according to their inferred performance. Soft Dawid Skene aggregates soft labels in contrast to hard labels often used in crowdsourcing. We empirically show the superiority of soft Dawid Skene in accuracy, calibration and out of distribution detection in comparison to ensemble averaging in extensive experiments.
