Table of Contents
Fetching ...

Snapshot Ensembles: Train 1, get M for free

Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E. Hopcroft, Kilian Q. Weinberger

TL;DR

Ensembling improves neural network generalization but is costly. The authors propose Snapshot Ensembling, which uses cyclic cosine learning rate cycles to drive a single model to multiple local minima, saving snapshots for an explicit ensemble without extra training cost. Across CIFAR, SVHN, Tiny ImageNet, and ImageNet using ResNet, DenseNet, and Wide-ResNet, the method yields consistent accuracy gains, with CIFAR-10 around 3.4% error and CIFAR-100 around 17.4%, and competitive ImageNet results (M=2). Analyses show the snapshots are diverse, contributing complementary predictions and justifying the approach.

Abstract

Ensembles of neural networks are known to be much more robust and accurate than individual networks. However, training multiple deep networks for model averaging is computationally expensive. In this paper, we propose a method to obtain the seemingly contradictory goal of ensembling multiple neural networks at no additional training cost. We achieve this goal by training a single neural network, converging to several local minima along its optimization path and saving the model parameters. To obtain repeated rapid convergence, we leverage recent work on cyclic learning rate schedules. The resulting technique, which we refer to as Snapshot Ensembling, is simple, yet surprisingly effective. We show in a series of experiments that our approach is compatible with diverse network architectures and learning tasks. It consistently yields lower error rates than state-of-the-art single models at no additional training cost, and compares favorably with traditional network ensembles. On CIFAR-10 and CIFAR-100 our DenseNet Snapshot Ensembles obtain error rates of 3.4% and 17.4% respectively.

Snapshot Ensembles: Train 1, get M for free

TL;DR

Ensembling improves neural network generalization but is costly. The authors propose Snapshot Ensembling, which uses cyclic cosine learning rate cycles to drive a single model to multiple local minima, saving snapshots for an explicit ensemble without extra training cost. Across CIFAR, SVHN, Tiny ImageNet, and ImageNet using ResNet, DenseNet, and Wide-ResNet, the method yields consistent accuracy gains, with CIFAR-10 around 3.4% error and CIFAR-100 around 17.4%, and competitive ImageNet results (M=2). Analyses show the snapshots are diverse, contributing complementary predictions and justifying the approach.

Abstract

Ensembles of neural networks are known to be much more robust and accurate than individual networks. However, training multiple deep networks for model averaging is computationally expensive. In this paper, we propose a method to obtain the seemingly contradictory goal of ensembling multiple neural networks at no additional training cost. We achieve this goal by training a single neural network, converging to several local minima along its optimization path and saving the model parameters. To obtain repeated rapid convergence, we leverage recent work on cyclic learning rate schedules. The resulting technique, which we refer to as Snapshot Ensembling, is simple, yet surprisingly effective. We show in a series of experiments that our approach is compatible with diverse network architectures and learning tasks. It consistently yields lower error rates than state-of-the-art single models at no additional training cost, and compares favorably with traditional network ensembles. On CIFAR-10 and CIFAR-100 our DenseNet Snapshot Ensembles obtain error rates of 3.4% and 17.4% respectively.

Paper Structure

This paper contains 9 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Left: Illustration of SGD optimization with a typical learning rate schedule. The model converges to a minimum at the end of training. Right: Illustration of Snapshot Ensembling. The model undergoes several learning rate annealing cycles, converging to and escaping from multiple local minima. We take a snapshot at each minimum for test-time ensembling.
  • Figure 2: Training loss of 100-layer DenseNet on CIFAR10 using standard learning rate (blue) and $M=6$ cosine annealing cycles (red). The intermediate models, denoted by the dotted lines, form an ensemble at the end of training.
  • Figure 3: DenseNet-100 Snapshot Ensemble performance on CIFAR-10 and CIFAR-100 with restart learning rate $\alpha_0=0.1$ (left two) and $\alpha_0=0.2$ (right two). Each ensemble is trained with $M\!=\!6$ annealing cycles (50 epochs per each).
  • Figure 4: Snapshot Ensembles under different training budgets on (Left) CIFAR-10 and (Middle) CIFAR-100. Right: Comparison of Snapshot Ensembles with true ensembles.
  • Figure 5: Interpolations in parameter space between the final model (sixth snapshot) and all intermediate snapshots. $\lambda=0$ represents an intermediate snapshot model, while $\lambda=1$ represents the final model. Left: A Snapshot Ensemble, with cosine annealing cycles ($\alpha_0=0.2$ every $B/M=50$ epochs). Right: A NoCycle Snapshot Ensemble, (two learning rate drops, snapshots every $50$ epochs).
  • ...and 4 more figures