Table of Contents
Fetching ...

Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network

Xiaoxia Wu, Simon S. Du, Rachel Ward

TL;DR

This work proposes an adaptive gradient method and shows that for two-layer over-parameterized neural networks -- if the width is sufficiently large (polynomially) -- then the proposed method converges to the global minimum in polynomial time, and convergence is robust.

Abstract

Adaptive gradient methods like AdaGrad are widely used in optimizing neural networks. Yet, existing convergence guarantees for adaptive gradient methods require either convexity or smoothness, and, in the smooth setting, only guarantee convergence to a stationary point. We propose an adaptive gradient method and show that for two-layer over-parameterized neural networks -- if the width is sufficiently large (polynomially) -- then the proposed method converges \emph{to the global minimum} in polynomial time, and convergence is robust, \emph{ without the need to fine-tune hyper-parameters such as the step-size schedule and with the level of over-parametrization independent of the training error}. Our analysis indicates in particular that over-parametrization is crucial for the harnessing the full potential of adaptive gradient methods in the setting of neural networks.

Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network

TL;DR

This work proposes an adaptive gradient method and shows that for two-layer over-parameterized neural networks -- if the width is sufficiently large (polynomially) -- then the proposed method converges to the global minimum in polynomial time, and convergence is robust.

Abstract

Adaptive gradient methods like AdaGrad are widely used in optimizing neural networks. Yet, existing convergence guarantees for adaptive gradient methods require either convexity or smoothness, and, in the smooth setting, only guarantee convergence to a stationary point. We propose an adaptive gradient method and show that for two-layer over-parameterized neural networks -- if the width is sufficiently large (polynomially) -- then the proposed method converges \emph{to the global minimum} in polynomial time, and convergence is robust, \emph{ without the need to fine-tune hyper-parameters such as the step-size schedule and with the level of over-parametrization independent of the training error}. Our analysis indicates in particular that over-parametrization is crucial for the harnessing the full potential of adaptive gradient methods in the setting of neural networks.

Paper Structure

This paper contains 29 sections, 15 theorems, 85 equations, 1 figure, 1 table, 1 algorithm.

Key Result

Theorem 3.1

Under Assumption asmp:norm1 and asmp:lambda_0, if the number of hidden nodes $m = \Omega\left(\frac{n^6}{\lambda_0^4 \delta^{3}}\right)$ and we set the stepsize to be then with probability at least $1-\delta$ over the random initialization, after$\widetilde{O}$ and $\widetilde{\Omega}$ hide $\log(n),\log(1/\lambda_0), \log(1/\delta)$ terms. iterations, we have $L(\mathbf{W}(T)) \le \varepsilon$.

Figures (1)

  • Figure 1: Top plots: y-axis is maximum or minimum eigenvalue of the matrix $\mathbf{H}(k)$, x-axis is the iteration. Bottom plots (left and middle): y-axis is the probability, x-axis is the eigenvalue of co-variance matrix induced by Gaussian data. Bottom plots (right): y-axis is the training error in logarithm scale, x-axis is the iteration. The distributions of eigenvalues for the co-variances matrix ($d \times d$ dimension) of the data are plotted on the left for i.i.d. Gaussian and in the middle for correlated Gaussian. The bottom right plot is the training error for the two-layer neural network $m=5000$ using the two Gaussian datasets.

Theorems & Definitions (19)

  • Definition 2.1
  • Theorem 3.1: Convergence Rate of Gradient Descent with Improved Learning Rate
  • Theorem 4.1: Convergence Rate of AdaLoss
  • Remark 4.1
  • Lemma 4.1
  • Lemma 4.2
  • Proposition 5.1
  • Proposition 5.2
  • Lemma B.1
  • Lemma B.2
  • ...and 9 more