Table of Contents
Fetching ...

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks

Sanjeev Arora, Nadav Cohen, Noah Golowich, Wei Hu

TL;DR

This paper delivers a trajectory-based convergence analysis for gradient descent training of deep linear neural networks on whitened data, proving linear-time convergence to the global minimum under two initialization-driven conditions: approximate balancedness across layers and a positive deficiency margin of the end-to-end weight matrix. The authors extend prior results from linear residual networks to general depth and width configurations, showing that as long as hidden dimensions meet a minimal requirement and the initialization satisfies the balance and margin criteria, gradient descent achieves fast convergence with a rate governed by the deficiency margin and network depth. They also provide a balanced initialization scheme with theoretical guarantees in the scalar-output case and validate their findings with experiments illustrating improved convergence when balance is enforced. The work advances understanding of optimization in deep non-convex settings and highlights initialization strategies that can mitigate vanishing/exploding gradient phenomena, with implications for broader non-linear architectures.

Abstract

We analyze speed of convergence to global optimum for gradient descent training a deep linear neural network (parameterized as $x \mapsto W_N W_{N-1} \cdots W_1 x$) by minimizing the $\ell_2$ loss over whitened data. Convergence at a linear rate is guaranteed when the following hold: (i) dimensions of hidden layers are at least the minimum of the input and output dimensions; (ii) weight matrices at initialization are approximately balanced; and (iii) the initial loss is smaller than the loss of any rank-deficient solution. The assumptions on initialization (conditions (ii) and (iii)) are necessary, in the sense that violating any one of them may lead to convergence failure. Moreover, in the important case of output dimension 1, i.e. scalar regression, they are met, and thus convergence to global optimum holds, with constant probability under a random initialization scheme. Our results significantly extend previous analyses, e.g., of deep linear residual networks (Bartlett et al., 2018).

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks

TL;DR

This paper delivers a trajectory-based convergence analysis for gradient descent training of deep linear neural networks on whitened data, proving linear-time convergence to the global minimum under two initialization-driven conditions: approximate balancedness across layers and a positive deficiency margin of the end-to-end weight matrix. The authors extend prior results from linear residual networks to general depth and width configurations, showing that as long as hidden dimensions meet a minimal requirement and the initialization satisfies the balance and margin criteria, gradient descent achieves fast convergence with a rate governed by the deficiency margin and network depth. They also provide a balanced initialization scheme with theoretical guarantees in the scalar-output case and validate their findings with experiments illustrating improved convergence when balance is enforced. The work advances understanding of optimization in deep non-convex settings and highlights initialization strategies that can mitigate vanishing/exploding gradient phenomena, with implications for broader non-linear architectures.

Abstract

We analyze speed of convergence to global optimum for gradient descent training a deep linear neural network (parameterized as ) by minimizing the loss over whitened data. Convergence at a linear rate is guaranteed when the following hold: (i) dimensions of hidden layers are at least the minimum of the input and output dimensions; (ii) weight matrices at initialization are approximately balanced; and (iii) the initial loss is smaller than the loss of any rank-deficient solution. The assumptions on initialization (conditions (ii) and (iii)) are necessary, in the sense that violating any one of them may lead to convergence failure. Moreover, in the important case of output dimension 1, i.e. scalar regression, they are met, and thus convergence to global optimum holds, with constant probability under a random initialization scheme. Our results significantly extend previous analyses, e.g., of deep linear residual networks (Bartlett et al., 2018).

Paper Structure

This paper contains 26 sections, 20 theorems, 130 equations, 2 figures.

Key Result

Theorem 1

Assume that gradient descent is initialized such that the end-to-end matrix $W_{1:N}(0)$ has deficiency margin $c>0$ with respect to the target $\Phi$, and the weights $W_1(0),\ldots,W_N(0)$ are $\delta$-balanced with $\delta=c^2/(256\cdot{N}^3\cdot\left\|\Phi \right\|_{F}^{2(N-1)/N})$. Suppose also Then, for any $\epsilon>0$ and: the loss at iteration $T$ of gradient descent --- $\ell(T)$ --- is

Figures (2)

  • Figure 1: Experimental results. (a) Convergence of gradient descent training deep linear neural networks (depths $3$ and $8$) under customary initialization of layer-wise independent Gaussian perturbations with mean $0$ and standard deviation $s$. For each network, number of iterations required to reach $\epsilon=10^{-5}$ from optimal training loss is plotted as a function of $s$ (missing values indicate no convergence within $10^6$ iterations). Dataset in this experiment is a numeric regression task from UCI Machine Learning Repository (details in text). Notice that fast convergence is attained only in a narrow band of values for $s$, and that this phenomenon is more extreme with the deeper network. (b) Same setup as in (a), but with layer-wise independent initialization replaced by balanced initialization (Procedure \ref{['proc:balance_init']}) based on Gaussian perturbations with mean $0$ and standard deviation $s$. Notice that this change leads to fast convergence, for both networks, under wide range of values for $s$. Notice also that the shallower network converges slightly faster, in line with the results of saxe2014exact and arora2018optimization for $\ell_2$ loss. (c) For the run in (a) of a depth-$3$ network and standard deviation $s={10^{-3}}$, this plot shows degree of balancedness (minimal $\delta$ satisfying $\|W_{j+1}^{\top}W_{j+1}-W_{j}W_j^\top\|_F\leq\delta~,\,\forall{j}\in\{1,\ldots,N-1\}$) against magnitude of weights ($\min_{j=1,\ldots,N}\|W_{j}W_j^\top\|_F$) throughout optimization. Notice that approximate balancedness persists under gradient descent, in line with our theoretical analysis. (d) Convergence of stochastic gradient descent training the fully-connected non-linear (ReLU) neural network of the MNIST tutorial built into TensorFlow (details in text). Customary layer-wise independent and balanced initializations --- both based on Gaussian perturbations centered at zero --- are evaluated, with varying standard deviations. For each configuration $10$ epochs of optimization are run, followed by measurement of the training loss. Notice that although our theoretical analysis does not cover non-linear activation, softmax-cross-entropy loss and stochastic optimization, the conclusion of balanced initialization leading to improved convergence carries over to this setting.
  • Figure 2: Figure for proof of Lemma \ref{['lemma:haar_def']}. The dashed region denotes $\mathcal{D}$. Not to scale.

Theorems & Definitions (54)

  • Definition 1
  • Definition 2
  • Claim 1
  • proof
  • Theorem 1
  • Lemma 1
  • proof : Proof of Lemma \ref{['lemma:descent']} (in idealized setting; for complete proof see Appendix \ref{['app:proofs:descent']})
  • proof : Proof of Theorem \ref{['theorem:converge']}
  • Theorem 2
  • proof
  • ...and 44 more