Table of Contents
Fetching ...

On Joint Regularization and Calibration in Deep Ensembles

Laurits Fredsgaard, Mikkel N. Schmidt

TL;DR

This work addresses how deep ensembles can be tuned more effectively by considering the ensemble as the primary objective rather than individual members. It introduces an ensemble-optimality framework and an overlapping holdout validation strategy to enable joint evaluation of weight decay, temperature scaling, and early stopping. Across image, graph, tabular, and text tasks, joint tuning often improves calibration and accuracy, though effects vary by task and metric; the overlapping holdout provides a practical compromise between data efficiency and joint evaluation. The results offer actionable guidance for practitioners on when and how to perform ensemble-aware optimization, and highlight initialization and validation choices as critical factors for robust, scalable deep ensembles.

Abstract

Deep ensembles are a powerful tool in machine learning, improving both model performance and uncertainty calibration. While ensembles are typically formed by training and tuning models individually, evidence suggests that jointly tuning the ensemble can lead to better performance. This paper investigates the impact of jointly tuning weight decay, temperature scaling, and early stopping on both predictive performance and uncertainty quantification. Additionally, we propose a partially overlapping holdout strategy as a practical compromise between enabling joint evaluation and maximizing the use of data for training. Our results demonstrate that jointly tuning the ensemble generally matches or improves performance, with significant variation in effect size across different tasks and metrics. We highlight the trade-offs between individual and joint optimization in deep ensemble training, with the overlapping holdout strategy offering an attractive practical solution. We believe our findings provide valuable insights and guidance for practitioners looking to optimize deep ensemble models. Code is available at: https://github.com/lauritsf/ensemble-optimality-gap

On Joint Regularization and Calibration in Deep Ensembles

TL;DR

This work addresses how deep ensembles can be tuned more effectively by considering the ensemble as the primary objective rather than individual members. It introduces an ensemble-optimality framework and an overlapping holdout validation strategy to enable joint evaluation of weight decay, temperature scaling, and early stopping. Across image, graph, tabular, and text tasks, joint tuning often improves calibration and accuracy, though effects vary by task and metric; the overlapping holdout provides a practical compromise between data efficiency and joint evaluation. The results offer actionable guidance for practitioners on when and how to perform ensemble-aware optimization, and highlight initialization and validation choices as critical factors for robust, scalable deep ensembles.

Abstract

Deep ensembles are a powerful tool in machine learning, improving both model performance and uncertainty calibration. While ensembles are typically formed by training and tuning models individually, evidence suggests that jointly tuning the ensemble can lead to better performance. This paper investigates the impact of jointly tuning weight decay, temperature scaling, and early stopping on both predictive performance and uncertainty quantification. Additionally, we propose a partially overlapping holdout strategy as a practical compromise between enabling joint evaluation and maximizing the use of data for training. Our results demonstrate that jointly tuning the ensemble generally matches or improves performance, with significant variation in effect size across different tasks and metrics. We highlight the trade-offs between individual and joint optimization in deep ensemble training, with the overlapping holdout strategy offering an attractive practical solution. We believe our findings provide valuable insights and guidance for practitioners looking to optimize deep ensemble models. Code is available at: https://github.com/lauritsf/ensemble-optimality-gap

Paper Structure

This paper contains 61 sections, 11 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Validation performance across varying weight decay values for a WRN-16-4 on CIFAR-10, a GCN on NCI1, an MLP on Covertype, and a BiLSTM on AG News. The plots show results for ensemble sizes 1 to 4 (WRN and GCN) and 1 to 8 (MLP and BiLSTM). The optimal weight decay for each ensemble size is selected based on the lowest average NLL. (WRN: Wide ResNet; GCN: Graph Convolutional Network; MLP: Multi-Layer Perceptron; BiLSTM: Bidirectional Long Short-Term Memory; NLL: negative log-likelihood; ECE: expected calibration error).
  • Figure 2: Temperature scaling test results for the full ensemble ($M=4$ for WRN and GCN; $M=8$ for MLP and BiLSTM). This plot compares different scaling approaches across varying validation percentages and holdout strategies. (WRN: Wide ResNet; GCN: Graph Convolutional Network; MLP: Multi-Layer Perceptron; BiLSTM: Bidirectional Long Short-Term Memory; NLL: negative log-likelihood; ECE: expected calibration error).
  • Figure 3: Early stopping test performance for the full ensemble ($M=4$ for WRN and GCN; $M=8$ for MLP and BiLSTM), comparing different early stopping strategies across all holdout types. (WRN: Wide ResNet; GCN: Graph Convolutional Network; MLP: Multi-Layer Perceptron; BiLSTM: Bidirectional Long Short-Term Memory; NLL: negative log-likelihood; ECE: expected calibration error).
  • Figure 4: Additional insights into the early stopping strategies for the full ensemble. We show the stopping epoch (normalized by training steps), alongside the resulting test set ensemble diversity and predictive entropy.(WRN: Wide ResNet; GCN: Graph Convolutional Network; MLP: Multi-Layer Perceptron; BiLSTM: Bidirectional Long Short-Term Memory).
  • Figure 5: Average individual model performance within BatchEnsemble ($M=4$, WRN-16-4 on CIFAR-10), comparing different initialization strategies for fast weights across shared, overlapping, and disjoint holdout strategies. The results for classification error, NLL, and ECE are shown for the test, validation, and training sets. Notably, for Gaussian initialization with overlapping and disjoint holdouts, the close alignment of validation and training performance (as opposed to test performance) suggests potential data leakage between ensemble members. (WRN: Wide ResNet; NLL: negative log-likelihood; ECE: expected calibration error).
  • ...and 3 more figures