Stacking as Accelerated Gradient Descent

Naman Agarwal; Pranjal Awasthi; Satyen Kale; Eric Zhao

Stacking as Accelerated Gradient Descent

Naman Agarwal, Pranjal Awasthi, Satyen Kale, Eric Zhao

TL;DR

This work addresses why stacking initialization speeds up stagewise training, by framing stacking as a form of accelerated gradient descent in function space. It develops a unified framework showing that zero, random, and stacking initializations correspond to functional gradient descent, stochastic functional gradient descent on a smoothed loss, and Nesterov-like accelerated descent, respectively. In a deep linear residual setting, it proves that stacking with an appropriate scaling \\beta yields a provably accelerated convergence rate, supported by a novel potential-function analysis that tolerates update perturbations. Empirical results on synthetic data and Transformer/BERT-like models corroborate the theory, demonstrating faster convergence and practical benefits of stacking-based initializations for training deep networks and additive ensembles.

Abstract

Stacking, a heuristic technique for training deep residual networks by progressively increasing the number of layers and initializing new layers by copying parameters from older layers, has proven quite successful in improving the efficiency of training deep neural networks. In this paper, we propose a theoretical explanation for the efficacy of stacking: viz., stacking implements a form of Nesterov's accelerated gradient descent. The theory also covers simpler models such as the additive ensembles constructed in boosting methods, and provides an explanation for a similar widely-used practical heuristic for initializing the new classifier in each round of boosting. We also prove that for certain deep linear residual networks, stacking does provide accelerated training, via a new potential function analysis of the Nesterov's accelerated gradient method which allows errors in updates. We conduct proof-of-concept experiments to validate our theory as well.

Stacking as Accelerated Gradient Descent

TL;DR

Abstract

Paper Structure (23 sections, 3 theorems, 58 equations, 7 figures)

This paper contains 23 sections, 3 theorems, 58 equations, 7 figures.

Introduction
Related work
Stagewise training as functional gradient descent
Preliminaries.
Greedy stagewise training.
Stagewise training with zero initialization recovers functional gradient descent.
Stagewise training with random initialization recovers stochastic functional gradient descent on smoothed loss.
Stagewise training with stacking initialization recovers accelerated functional gradient descent.
Accelerated convergence of deep linear networks by stacking
Setup.
Derivation of stacking updates.
Accelerated convergence for stacking updates.
Sufficient claim.
Base case: $t=1$.
Inductive step.
...and 8 more sections

Key Result

Theorem 3.1

Consider stagewise training with stacking initialization of a deep residual linear network in the setup described above with $\beta = \tfrac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}$ and $\lambda = \tfrac{1}{L}$. Suppose that the first layer weights are initialized so that $W_1 = V_0 - \frac{1}{L}\nabla \e

Figures (7)

Figure 1: Stacking for stagewise training language models. In each stage, a new transformer block is added, initialized with the parameters of the top block from the previous stage, and then trained for a certain number of steps.
Figure 2: Stacking init vs random init for stagewise training of BERT Base model. Four stages are used with 168,750 steps in each stage. Stage boundaries are marked by vertical dashed lines. Stacking init provides a clear benefit over random init.
Figure 3: Stacking for boosting. In each stage, a new classifier is added, initialized with the parameters of the last trained classifier from the previous stage, and then trained for a certain number of steps.
Figure 4: Mean squared error (MSE) vs. number of stacking stages. We observe that as the data becomes more ill conditioned both the stacking updates and Nesterov's updates demonstrate faster convergence than vanilla gradient descent.
Figure 5: Mean squared error (MSE) vs. number of stacking stages. The figure compares stacking updates and Nesterov's updates as $W^*$ becomes farther from Identity, i.e. $\sigma$ increases. We observe that for higher values of $\sigma$ the stacking updates display a diverging behavior in the initial stages.
...and 2 more figures

Theorems & Definitions (14)

Theorem 3.1
Lemma 3.1: Robustness of Nesterov's accelerated gradient method
proof : Proof of Theorem \ref{['theorem:stronglyconvexconvergence']}
Claim 3.2
proof : Proof of Claim \ref{['claim:closeness-to-bound']}
proof : Proof of Fact \ref{['fact:inverse']}
Claim 3.4
proof : Proof of Claim \ref{['claim:induction']}
Corollary 3.4: Corollary of Theorem \ref{['theorem:stronglyconvexconvergence']}
proof
...and 4 more

Stacking as Accelerated Gradient Descent

TL;DR

Abstract

Stacking as Accelerated Gradient Descent

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (14)