Table of Contents
Fetching ...

A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation

Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong, Richard Socher

TL;DR

Problem: understanding why learning rate restarts, warmup, and distillation work in deep learning. Approach: apply mode connectivity (MC) and singular value CC analysis (SVCCA/CCA) to analyze loss surfaces and representations under these heuristics. Contributions/findings: cosine annealing explanations are not consistently supported; warmup stabilizes deeper layers; distillation transfers latent knowledge primarily to deeper layers; MC is robust across training choices; additional SGDR and warmup experiments across architectures. Significance: provides a nuanced view of training dynamics and suggests directions for theory and practical improvements across modern networks.

Abstract

The convergence rate and final performance of common deep learning models have significantly benefited from heuristics such as learning rate schedules, knowledge distillation, skip connections, and normalization layers. In the absence of theoretical underpinnings, controlled experiments aimed at explaining these strategies can aid our understanding of deep learning landscapes and the training dynamics. Existing approaches for empirical analysis rely on tools of linear interpolation and visualizations with dimensionality reduction, each with their limitations. Instead, we revisit such analysis of heuristics through the lens of recently proposed methods for loss surface and representation analysis, viz., mode connectivity and canonical correlation analysis (CCA), and hypothesize reasons for the success of the heuristics. In particular, we explore knowledge distillation and learning rate heuristics of (cosine) restarts and warmup using mode connectivity and CCA. Our empirical analysis suggests that: (a) the reasons often quoted for the success of cosine annealing are not evidenced in practice; (b) that the effect of learning rate warmup is to prevent the deeper layers from creating training instability; and (c) that the latent knowledge shared by the teacher is primarily disbursed to the deeper layers.

A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation

TL;DR

Problem: understanding why learning rate restarts, warmup, and distillation work in deep learning. Approach: apply mode connectivity (MC) and singular value CC analysis (SVCCA/CCA) to analyze loss surfaces and representations under these heuristics. Contributions/findings: cosine annealing explanations are not consistently supported; warmup stabilizes deeper layers; distillation transfers latent knowledge primarily to deeper layers; MC is robust across training choices; additional SGDR and warmup experiments across architectures. Significance: provides a nuanced view of training dynamics and suggests directions for theory and practical improvements across modern networks.

Abstract

The convergence rate and final performance of common deep learning models have significantly benefited from heuristics such as learning rate schedules, knowledge distillation, skip connections, and normalization layers. In the absence of theoretical underpinnings, controlled experiments aimed at explaining these strategies can aid our understanding of deep learning landscapes and the training dynamics. Existing approaches for empirical analysis rely on tools of linear interpolation and visualizations with dimensionality reduction, each with their limitations. Instead, we revisit such analysis of heuristics through the lens of recently proposed methods for loss surface and representation analysis, viz., mode connectivity and canonical correlation analysis (CCA), and hypothesize reasons for the success of the heuristics. In particular, we explore knowledge distillation and learning rate heuristics of (cosine) restarts and warmup using mode connectivity and CCA. Our empirical analysis suggests that: (a) the reasons often quoted for the success of cosine annealing are not evidenced in practice; (b) that the effect of learning rate warmup is to prevent the deeper layers from creating training instability; and (c) that the latent knowledge shared by the teacher is primarily disbursed to the deeper layers.

Paper Structure

This paper contains 19 sections, 4 equations, 15 figures.

Figures (15)

  • Figure 1: Validation accuracy corresponding to models on the following 6 different curves - curve $GA$ represents curve connecting mode $G$ (one found with default hyperparameters) and mode $A$ (using large batch size), similarly, curve $GB$ connects mode $G$ and mode $B$ (using Adam), curve $GC$ connects to mode $C$ (using linearly decaying learning rate), curve $GD$ to mode $D$ (with lesser L2 regularization), curve $GE$ to mode $E$ (using a poor initialization), and curve $GF$ to mode $F$ (without using data augmentation). $t=0$ corresponds to mode $G$ for all plots.
  • Figure 2: (a) Validation accuracy of a VGG16 model trained on CIFAR-10 using SGDR with warm restarts simulated every $T_0 = 10$ epochs and $T_{mult} = 2$. (b) SGDR and SGD learning rate schemes. (c) Cross-entropy training loss on the curve found through Mode Connectivity (MC Curve) and on the line segment (Line Seg.) joining modes $w_{30}$ (model corresponding to parameters at the $30$-th epoch of SGDR) and $w_{70}$, $w_{70}$ and $w_{150}$, $w_{30}$ and $w_{150}$. (d) Cross-entropy training loss on the curve found through Mode Connectivity (MC Curve) and on the line segment (Line Seg.) joining modes $w_{55}$ (model corresponding to parameters at the $55$-th epoch of SGD with step decay learning rate scheme) and $w_{65}$, $w_{145}$ and $w_{155}$, $w_{55}$ and $w_{155}$.
  • Figure 3: (a) Training loss surface and (b) validation loss surface, log scales, for points on the plane defined by $\{w_{70},w_{150},w_{70-150}\}$ including projections of the SGDR iterates on this hyperplane.
  • Figure 4: (a) Validation accuracy and (b) Learning rate for the three training setups (c) CCA similarity for $i$-th layer from two different iterations ($0$-th (before warmup) and $200$-th (after warmup) during training (d) Comparing warmup and FC freezing strategies on VGG11 training
  • Figure 5: CCA similarity output plots for (a) SB no warmup, (b) LB no warmup, (c, d) LB + warmup training. The $i,j$-th cell represents the CCA similarity between layer $i$ of the first model, and layer $j$ of other. A higher score implies that the layers are more similar (lighter color).
  • ...and 10 more figures