Table of Contents
Fetching ...

FreezeOut: Accelerate Training by Progressively Freezing Layers

Andrew Brock, Theodore Lim, J. M. Ritchie, Nick Weston

TL;DR

FreezeOut addresses training inefficiency by progressively freezing early neural network layers and excluding them from backpropagation. It uses layer-specific cosine-annealed learning-rate schedules that decay to zero at layer-dependent milestones, effectively freezing layers as training progresses. Empirical results on CIFAR with DenseNet, WideResNet, and VGG show architecture-dependent benefits, including up to 20% training-time speedups (with modest accuracy loss for DenseNet and no loss for some ResNets) and limited gains for VGG. The authors provide practical defaults (cubic scheduling with LR scaling and t0 ≈ 0.512) and public PyTorch code, highlighting FreezeOut as a viable speedup for prototyping and resource-constrained training.

Abstract

The early layers of a deep neural net have the fewest parameters, but take up the most computation. In this extended abstract, we propose to only train the hidden layers for a set portion of the training run, freezing them out one-by-one and excluding them from the backward pass. Through experiments on CIFAR, we empirically demonstrate that FreezeOut yields savings of up to 20% wall-clock time during training with 3% loss in accuracy for DenseNets, a 20% speedup without loss of accuracy for ResNets, and no improvement for VGG networks. Our code is publicly available at https://github.com/ajbrock/FreezeOut

FreezeOut: Accelerate Training by Progressively Freezing Layers

TL;DR

FreezeOut addresses training inefficiency by progressively freezing early neural network layers and excluding them from backpropagation. It uses layer-specific cosine-annealed learning-rate schedules that decay to zero at layer-dependent milestones, effectively freezing layers as training progresses. Empirical results on CIFAR with DenseNet, WideResNet, and VGG show architecture-dependent benefits, including up to 20% training-time speedups (with modest accuracy loss for DenseNet and no loss for some ResNets) and limited gains for VGG. The authors provide practical defaults (cubic scheduling with LR scaling and t0 ≈ 0.512) and public PyTorch code, highlighting FreezeOut as a viable speedup for prototyping and resource-constrained training.

Abstract

The early layers of a deep neural net have the fewest parameters, but take up the most computation. In this extended abstract, we propose to only train the hidden layers for a set portion of the training run, freezing them out one-by-one and excluding them from the backward pass. Through experiments on CIFAR, we empirically demonstrate that FreezeOut yields savings of up to 20% wall-clock time during training with 3% loss in accuracy for DenseNets, a 20% speedup without loss of accuracy for ResNets, and no improvement for VGG networks. Our code is publicly available at https://github.com/ajbrock/FreezeOut

Paper Structure

This paper contains 7 sections, 3 equations, 5 figures.

Figures (5)

  • Figure 1: Per-Layer Learning Rate Schedules for a 5-hidden-layer network with $t_0=0.5$.
  • Figure 2: FreezeOut results for k=12, L=76 DenseNets on CIFAR-100 for 100 epochs. Shaded areas represent one standard deviation from the mean across 2-5 training runs.
  • Figure 3: FreezeOut results for k=12, L=76 DenseNets on CIFAR-10 for 100 epochs.
  • Figure 4: FreezeOut results for WRN40-4 on CIFAR-100. Shaded areas represent one standard deviation from the mean across 3 training runs for Cubic Scaled and 4 training runs for Linear Unscaled.
  • Figure 5: FreezeOut results for VGG-16 on CIFAR-100. Error Bars represent a single standard deviation from the mean across three training runs. Error bars instead of shaded error lines used here for improved clarity.