Credal Ensemble Distillation for Uncertainty Quantification
Kaizheng Wang, Fabio Cuzzolin, David Moens, Hans Hallez
TL;DR
This work tackles the high computational cost of deep ensembles for predictive uncertainty by introducing credal ensemble distillation (CED), which compresses a DE of $M$ SNNs into a single model, CREDIT, that outputs class-wise probability intervals forming a credal set. CREDIT predicts an intersection probability $p_S^{*}$ along with interval lengths and a weight $eta_S$, enabling both accurate class prediction and principled uncertainty quantification via upper/lower entropies over the credal set. The training uses a distillation loss that preserves the ensemble’s predictive performance while transferring the credal information, and uncertainty is quantified using generalized entropy measures; empirically, CED achieves superior or competitive uncertainty estimation (especially for epistemic uncertainty) with much lower inference overhead than running the full DE, across multiple datasets and backbones. This approach offers a scalable, principled alternative for uncertainty quantification in neural classifiers with practical impact on OOD detection and reliability.
Abstract
Deep ensembles (DE) have emerged as a powerful approach for quantifying predictive uncertainty and distinguishing its aleatoric and epistemic components, thereby enhancing model robustness and reliability. However, their high computational and memory costs during inference pose significant challenges for wide practical deployment. To overcome this issue, we propose credal ensemble distillation (CED), a novel framework that compresses a DE into a single model, CREDIT, for classification tasks. Instead of a single softmax probability distribution, CREDIT predicts class-wise probability intervals that define a credal set, a convex set of probability distributions, for uncertainty quantification. Empirical results on out-of-distribution detection benchmarks demonstrate that CED achieves superior or comparable uncertainty estimation compared to several existing baselines, while substantially reducing inference overhead compared to DE.
