Provable Accelerated Convergence of Nesterov's Momentum for Deep ReLU Neural Networks

Fangshuo Liao; Anastasios Kyrillidis

Provable Accelerated Convergence of Nesterov's Momentum for Deep ReLU Neural Networks

Fangshuo Liao, Anastasios Kyrillidis

TL;DR

This work considers a new class of objective functions, where only a subset of the parameters satisfies strong convexity, and shows Nesterov's momentum achieves acceleration in theory for this objective class, which is deep ReLU networks.

Abstract

Current state-of-the-art analyses on the convergence of gradient descent for training neural networks focus on characterizing properties of the loss landscape, such as the Polyak-Lojaciewicz (PL) condition and the restricted strong convexity. While gradient descent converges linearly under such conditions, it remains an open question whether Nesterov's momentum enjoys accelerated convergence under similar settings and assumptions. In this work, we consider a new class of objective functions, where only a subset of the parameters satisfies strong convexity, and show Nesterov's momentum achieves acceleration in theory for this objective class. We provide two realizations of the problem class, one of which is deep ReLU networks, which --to the best of our knowledge--constitutes this work the first that proves accelerated convergence rate for non-trivial neural network architectures.

Provable Accelerated Convergence of Nesterov's Momentum for Deep ReLU Neural Networks

TL;DR

Abstract

Paper Structure (38 sections, 21 theorems, 271 equations, 1 figure)

This paper contains 38 sections, 21 theorems, 271 equations, 1 figure.

Introduction
Related Works
Problem Setup and Assumptions
Accelerated Convergence under Partial Strong Convexity
Warmup: Convergence of Gradient Descent
Acceleration of Nesterov's Momentum
Technical Difficulty
Proof Overview
Realization of Assumption \ref{['asump:strong_cvx']}-\ref{['asump:universal_opt']}
Additive model
Deep ReLU Neural Networks
Conclusion and Broader Impact
Proofs for Section \ref{['sec:preliminary']}
Proof of Theorem \ref{['theo:strong_cvx_smooth_suffice']}
Proofs for Section \ref{['sec:gd_conv']}
...and 23 more sections

Key Result

Theorem 1

Let $f\left({\mathbf{x}},{\mathbf{u}}\right):\mathbb{R}^{d_1}\times \mathbb{R}^{d_2}\rightarrow\mathbb{R}$ be $L_1$-smooth and $\mu$-strongly convex with respect to ${\mathbf{x}}$ for all ${\mathbf{u}}\in\mathbb{R}^{d_2}$, and let $\kappa = L_1/\mu$. If $f\left({\mathbf{x}},{\mathbf{u}}\right)$ also

Figures (1)

Figure 1: Experiment of learning additive model with gradient descent and Nesterov's momentum.

Theorems & Definitions (23)

Theorem 1: Informal statement of Theorem \ref{['theo:nesterov_conv']}
Theorem 2: Informal statement of Theorem \ref{['theo:nn_nesterov_conv']}
Definition 1
Definition 2
Theorem 3
Lemma 4
Lemma 5
Theorem 6
Theorem 7
Lemma 8
...and 13 more

Provable Accelerated Convergence of Nesterov's Momentum for Deep ReLU Neural Networks

TL;DR

Abstract

Provable Accelerated Convergence of Nesterov's Momentum for Deep ReLU Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (23)