Checkpoint Ensembles: Ensemble Methods from a Single Training Process

Hugh Chen; Scott Lundberg; Su-In Lee

Checkpoint Ensembles: Ensemble Methods from a Single Training Process

Hugh Chen, Scott Lundberg, Su-In Lee

TL;DR

Checkpoint ensembles (CE) provide a practical way to realize ensemble benefits from a single training run by saving top-performing checkpoints based on validation scores and averaging their predictions (or weights). The approach rivals traditional ensemble gains (MV, RIE, smoothing variants) across text, image, and time-series tasks while reducing training epochs, and it often permits higher learning rates with faster convergence. The experiments on Reuters, CIFAR-10, and operating-room data demonstrate that CE yields significant improvements over minimum validation and captures a portion of random initialization ensemble benefits, with dataset-dependent behavior for smoothing methods. Overall, CE offers a simple, efficient, and broadly applicable strategy for robust predictive performance in neural networks and other iterative learners.

Abstract

We present the checkpoint ensembles method that can learn ensemble models on a single training process. Although checkpoint ensembles can be applied to any parametric iterative learning technique, here we focus on neural networks. Neural networks' composable and simple neurons make it possible to capture many individual and interaction effects among features. However, small sample sizes and sampling noise may result in patterns in the training data that are not representative of the true relationship between the features and the outcome. As a solution, regularization during training is often used (e.g. dropout). However, regularization is no panacea -- it does not perfectly address overfitting. Even with methods like dropout, two methodologies are commonly used in practice. First is to utilize a validation set independent to the training set as a way to decide when to stop training. Second is to use ensemble methods to further reduce overfitting and take advantage of local optima (i.e. averaging over the predictions of several models). In this paper, we explore checkpoint ensembles -- a simple technique that combines these two ideas in one training process. Checkpoint ensembles improve performance by averaging the predictions from "checkpoints" of the best models within single training process. We use three real-world data sets -- text, image, and electronic health record data -- using three prediction models: a vanilla neural network, a convolutional neural network, and a long short term memory network to show that checkpoint ensembles outperform existing methods: a method that selects a model by minimum validation score, and two methods that average models by weights. Our results also show that checkpoint ensembles capture a portion of the performance gains that traditional ensembles provide.

Checkpoint Ensembles: Ensemble Methods from a Single Training Process

TL;DR

Abstract

Checkpoint Ensembles: Ensemble Methods from a Single Training Process

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)