Table of Contents
Fetching ...

Stacking as Accelerated Gradient Descent

Naman Agarwal, Pranjal Awasthi, Satyen Kale, Eric Zhao

TL;DR

This work addresses why stacking initialization speeds up stagewise training, by framing stacking as a form of accelerated gradient descent in function space. It develops a unified framework showing that zero, random, and stacking initializations correspond to functional gradient descent, stochastic functional gradient descent on a smoothed loss, and Nesterov-like accelerated descent, respectively. In a deep linear residual setting, it proves that stacking with an appropriate scaling \\beta yields a provably accelerated convergence rate, supported by a novel potential-function analysis that tolerates update perturbations. Empirical results on synthetic data and Transformer/BERT-like models corroborate the theory, demonstrating faster convergence and practical benefits of stacking-based initializations for training deep networks and additive ensembles.

Abstract

Stacking, a heuristic technique for training deep residual networks by progressively increasing the number of layers and initializing new layers by copying parameters from older layers, has proven quite successful in improving the efficiency of training deep neural networks. In this paper, we propose a theoretical explanation for the efficacy of stacking: viz., stacking implements a form of Nesterov's accelerated gradient descent. The theory also covers simpler models such as the additive ensembles constructed in boosting methods, and provides an explanation for a similar widely-used practical heuristic for initializing the new classifier in each round of boosting. We also prove that for certain deep linear residual networks, stacking does provide accelerated training, via a new potential function analysis of the Nesterov's accelerated gradient method which allows errors in updates. We conduct proof-of-concept experiments to validate our theory as well.

Stacking as Accelerated Gradient Descent

TL;DR

This work addresses why stacking initialization speeds up stagewise training, by framing stacking as a form of accelerated gradient descent in function space. It develops a unified framework showing that zero, random, and stacking initializations correspond to functional gradient descent, stochastic functional gradient descent on a smoothed loss, and Nesterov-like accelerated descent, respectively. In a deep linear residual setting, it proves that stacking with an appropriate scaling \\beta yields a provably accelerated convergence rate, supported by a novel potential-function analysis that tolerates update perturbations. Empirical results on synthetic data and Transformer/BERT-like models corroborate the theory, demonstrating faster convergence and practical benefits of stacking-based initializations for training deep networks and additive ensembles.

Abstract

Stacking, a heuristic technique for training deep residual networks by progressively increasing the number of layers and initializing new layers by copying parameters from older layers, has proven quite successful in improving the efficiency of training deep neural networks. In this paper, we propose a theoretical explanation for the efficacy of stacking: viz., stacking implements a form of Nesterov's accelerated gradient descent. The theory also covers simpler models such as the additive ensembles constructed in boosting methods, and provides an explanation for a similar widely-used practical heuristic for initializing the new classifier in each round of boosting. We also prove that for certain deep linear residual networks, stacking does provide accelerated training, via a new potential function analysis of the Nesterov's accelerated gradient method which allows errors in updates. We conduct proof-of-concept experiments to validate our theory as well.
Paper Structure (23 sections, 3 theorems, 58 equations, 7 figures)

This paper contains 23 sections, 3 theorems, 58 equations, 7 figures.

Key Result

Theorem 3.1

Consider stagewise training with stacking initialization of a deep residual linear network in the setup described above with $\beta = \tfrac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}$ and $\lambda = \tfrac{1}{L}$. Suppose that the first layer weights are initialized so that $W_1 = V_0 - \frac{1}{L}\nabla \e

Figures (7)

  • Figure 1: Stacking for stagewise training language models. In each stage, a new transformer block is added, initialized with the parameters of the top block from the previous stage, and then trained for a certain number of steps.
  • Figure 2: Stacking init vs random init for stagewise training of BERT Base model. Four stages are used with 168,750 steps in each stage. Stage boundaries are marked by vertical dashed lines. Stacking init provides a clear benefit over random init.
  • Figure 3: Stacking for boosting. In each stage, a new classifier is added, initialized with the parameters of the last trained classifier from the previous stage, and then trained for a certain number of steps.
  • Figure 4: Mean squared error (MSE) vs. number of stacking stages. We observe that as the data becomes more ill conditioned both the stacking updates and Nesterov's updates demonstrate faster convergence than vanilla gradient descent.
  • Figure 5: Mean squared error (MSE) vs. number of stacking stages. The figure compares stacking updates and Nesterov's updates as $W^*$ becomes farther from Identity, i.e. $\sigma$ increases. We observe that for higher values of $\sigma$ the stacking updates display a diverging behavior in the initial stages.
  • ...and 2 more figures

Theorems & Definitions (14)

  • Theorem 3.1
  • Lemma 3.1: Robustness of Nesterov's accelerated gradient method
  • proof : Proof of Theorem \ref{['theorem:stronglyconvexconvergence']}
  • Claim 3.2
  • proof : Proof of Claim \ref{['claim:closeness-to-bound']}
  • proof : Proof of Fact \ref{['fact:inverse']}
  • Claim 3.4
  • proof : Proof of Claim \ref{['claim:induction']}
  • Corollary 3.4: Corollary of Theorem \ref{['theorem:stronglyconvexconvergence']}
  • proof
  • ...and 4 more