Table of Contents
Fetching ...

Checkpoint Ensembles: Ensemble Methods from a Single Training Process

Hugh Chen, Scott Lundberg, Su-In Lee

TL;DR

Checkpoint ensembles (CE) provide a practical way to realize ensemble benefits from a single training run by saving top-performing checkpoints based on validation scores and averaging their predictions (or weights). The approach rivals traditional ensemble gains (MV, RIE, smoothing variants) across text, image, and time-series tasks while reducing training epochs, and it often permits higher learning rates with faster convergence. The experiments on Reuters, CIFAR-10, and operating-room data demonstrate that CE yields significant improvements over minimum validation and captures a portion of random initialization ensemble benefits, with dataset-dependent behavior for smoothing methods. Overall, CE offers a simple, efficient, and broadly applicable strategy for robust predictive performance in neural networks and other iterative learners.

Abstract

We present the checkpoint ensembles method that can learn ensemble models on a single training process. Although checkpoint ensembles can be applied to any parametric iterative learning technique, here we focus on neural networks. Neural networks' composable and simple neurons make it possible to capture many individual and interaction effects among features. However, small sample sizes and sampling noise may result in patterns in the training data that are not representative of the true relationship between the features and the outcome. As a solution, regularization during training is often used (e.g. dropout). However, regularization is no panacea -- it does not perfectly address overfitting. Even with methods like dropout, two methodologies are commonly used in practice. First is to utilize a validation set independent to the training set as a way to decide when to stop training. Second is to use ensemble methods to further reduce overfitting and take advantage of local optima (i.e. averaging over the predictions of several models). In this paper, we explore checkpoint ensembles -- a simple technique that combines these two ideas in one training process. Checkpoint ensembles improve performance by averaging the predictions from "checkpoints" of the best models within single training process. We use three real-world data sets -- text, image, and electronic health record data -- using three prediction models: a vanilla neural network, a convolutional neural network, and a long short term memory network to show that checkpoint ensembles outperform existing methods: a method that selects a model by minimum validation score, and two methods that average models by weights. Our results also show that checkpoint ensembles capture a portion of the performance gains that traditional ensembles provide.

Checkpoint Ensembles: Ensemble Methods from a Single Training Process

TL;DR

Checkpoint ensembles (CE) provide a practical way to realize ensemble benefits from a single training run by saving top-performing checkpoints based on validation scores and averaging their predictions (or weights). The approach rivals traditional ensemble gains (MV, RIE, smoothing variants) across text, image, and time-series tasks while reducing training epochs, and it often permits higher learning rates with faster convergence. The experiments on Reuters, CIFAR-10, and operating-room data demonstrate that CE yields significant improvements over minimum validation and captures a portion of random initialization ensemble benefits, with dataset-dependent behavior for smoothing methods. Overall, CE offers a simple, efficient, and broadly applicable strategy for robust predictive performance in neural networks and other iterative learners.

Abstract

We present the checkpoint ensembles method that can learn ensemble models on a single training process. Although checkpoint ensembles can be applied to any parametric iterative learning technique, here we focus on neural networks. Neural networks' composable and simple neurons make it possible to capture many individual and interaction effects among features. However, small sample sizes and sampling noise may result in patterns in the training data that are not representative of the true relationship between the features and the outcome. As a solution, regularization during training is often used (e.g. dropout). However, regularization is no panacea -- it does not perfectly address overfitting. Even with methods like dropout, two methodologies are commonly used in practice. First is to utilize a validation set independent to the training set as a way to decide when to stop training. Second is to use ensemble methods to further reduce overfitting and take advantage of local optima (i.e. averaging over the predictions of several models). In this paper, we explore checkpoint ensembles -- a simple technique that combines these two ideas in one training process. Checkpoint ensembles improve performance by averaging the predictions from "checkpoints" of the best models within single training process. We use three real-world data sets -- text, image, and electronic health record data -- using three prediction models: a vanilla neural network, a convolutional neural network, and a long short term memory network to show that checkpoint ensembles outperform existing methods: a method that selects a model by minimum validation score, and two methods that average models by weights. Our results also show that checkpoint ensembles capture a portion of the performance gains that traditional ensembles provide.

Paper Structure

This paper contains 16 sections, 4 figures, 4 tables, 5 algorithms.

Figures (4)

  • Figure 1: The rounded boxes going from left to right represent models at each step of a particular training process (e.g. using gradient descent). The shading represents validation score -- lighter shades represent a better score. For either ensemble, we average the predictions from the best models to get the final prediction $P$.
  • Figure 2: Pictoral representation of scenarios for gradient descent: (A) when there is one optimal point and (B) when there are two local optima. The shading represents the optimum in terms of our loss function, plotted against our (two) parameters. The whiter the shade, the closer to optimal. The arrows represent gradient descent.
  • Figure 3: Accuracy on the test set and epochs to convergence (i.e. number of sequential epochs to the maximum validation accuracy) for different learning rates. We fit a spline to the accuracy and draw vertical lines through the maximum point on each of the splines.
  • Figure 4: Accuracy on the test set and epochs to convergence (i.e. number of sequential epochs to the maximum validation accuracy) for different learning rates. We fit a spline to the accuracy and draw vertical lines through the maximum point on each of the splines.