Table of Contents
Fetching ...

Overparameterization of deep ResNet: zero loss and mean-field analysis

Zhiyan Ding, Shi Chen, Qin Li, Stephen Wright

TL;DR

This work analyzes why gradient descent finds interpolating, zero-loss solutions in extremely overparameterized ResNets by casting training as a gradient flow over a probability distribution of layer-parameters. By taking a continuous-depth limit and a mean-field width limit, the authors derive a forward dynamics for $z(t;x)$ coupled with a Wasserstein gradient-flow PDE for the parameter density $\rho$, regularized by $E_s$. They prove that, under suitable assumptions, the limiting PDE converges to a global minimizer (zero loss) and provide finite-depth/width conditions ensuring near-zero loss with high probability for large but finite networks. The results offer a rigorous mechanism linking overparameterization to zero-loss interpolation and establish well-posedness and convergence properties of the mean-field and continuum formulations.

Abstract

Finding parameters in a deep neural network (NN) that fit training data is a nonconvex optimization problem, but a basic first-order optimization method (gradient descent) finds a global optimizer with perfect fit (zero-loss) in many practical situations. We examine this phenomenon for the case of Residual Neural Networks (ResNet) with smooth activation functions in a limiting regime in which both the number of layers (depth) and the number of weights in each layer (width) go to infinity. First, we use a mean-field-limit argument to prove that the gradient descent for parameter training becomes a gradient flow for a probability distribution that is characterized by a partial differential equation (PDE) in the large-NN limit. Next, we show that under certain assumptions, the solution to the PDE converges in the training time to a zero-loss solution. Together, these results suggest that the training of the ResNet gives a near-zero loss if the ResNet is large enough. We give estimates of the depth and width needed to reduce the loss below a given threshold, with high probability.

Overparameterization of deep ResNet: zero loss and mean-field analysis

TL;DR

This work analyzes why gradient descent finds interpolating, zero-loss solutions in extremely overparameterized ResNets by casting training as a gradient flow over a probability distribution of layer-parameters. By taking a continuous-depth limit and a mean-field width limit, the authors derive a forward dynamics for coupled with a Wasserstein gradient-flow PDE for the parameter density , regularized by . They prove that, under suitable assumptions, the limiting PDE converges to a global minimizer (zero loss) and provide finite-depth/width conditions ensuring near-zero loss with high probability for large but finite networks. The results offer a rigorous mechanism linking overparameterization to zero-loss interpolation and establish well-posedness and convergence properties of the mean-field and continuum formulations.

Abstract

Finding parameters in a deep neural network (NN) that fit training data is a nonconvex optimization problem, but a basic first-order optimization method (gradient descent) finds a global optimizer with perfect fit (zero-loss) in many practical situations. We examine this phenomenon for the case of Residual Neural Networks (ResNet) with smooth activation functions in a limiting regime in which both the number of layers (depth) and the number of weights in each layer (width) go to infinity. First, we use a mean-field-limit argument to prove that the gradient descent for parameter training becomes a gradient flow for a probability distribution that is characterized by a partial differential equation (PDE) in the large-NN limit. Next, we show that under certain assumptions, the solution to the PDE converges in the training time to a zero-loss solution. Together, these results suggest that the training of the ResNet gives a near-zero loss if the ResNet is large enough. We give estimates of the depth and width needed to reduce the loss below a given threshold, with high probability.

Paper Structure

This paper contains 34 sections, 28 theorems, 254 equations.

Key Result

Theorem 6

Suppose that Assumption assum:f holds. Let $\rho_{\mathrm{ini}}(\theta,t)$ be limit-admissible, and $\{\theta_m(0;t)\}^M_{m=1}$ in eqn:Wassgradientflowsdis be $i.i.d.$ drawn from $\rho_{\mathrm{ini}}(\theta,t)$. Let Then for any positive $\epsilon$, $\eta$, and $S$, there exists a constant $C>0$ that depends on $\rho_{\mathrm{ini}}(\theta,t)$ and $S$ such that when we have where $E(\Theta_{L,M}

Theorems & Definitions (42)

  • Remark 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Remark 5
  • Theorem 6
  • Theorem 7
  • Remark 8: Assumptions in Theorem \ref{['thm:globalminimal']}.
  • Theorem 9
  • Theorem 10
  • ...and 32 more