Convergence of Gradient Descent for Recurrent Neural Networks: A Nonasymptotic Analysis

Semih Cayci; Atilla Eryilmaz

Convergence of Gradient Descent for Recurrent Neural Networks: A Nonasymptotic Analysis

Semih Cayci, Atilla Eryilmaz

TL;DR

An in-depth nonasymptotic analysis of recurrent neural networks with diagonal hidden-to-hidden weight matrices, trained with gradient descent in the supervised learning setting, proves that gradient descent can achieve optimality without massive overparameterization.

Abstract

We analyze recurrent neural networks with diagonal hidden-to-hidden weight matrices, trained with gradient descent in the supervised learning setting, and prove that gradient descent can achieve optimality \emph{without} massive overparameterization. Our in-depth nonasymptotic analysis (i) provides improved bounds on the network size $m$ in terms of the sequence length $T$, sample size $n$ and ambient dimension $d$, and (ii) identifies the significant impact of long-term dependencies in the dynamical system on the convergence and network width bounds characterized by a cutoff point that depends on the Lipschitz continuity of the activation function. Remarkably, this analysis reveals that an appropriately-initialized recurrent neural network trained with $n$ samples can achieve optimality with a network size $m$ that scales only logarithmically with $n$. This sharply contrasts with the prior works that require high-order polynomial dependency of $m$ on $n$ to establish strong regularity conditions. Our results are based on an explicit characterization of the class of dynamical systems that can be approximated and learned by recurrent neural networks via norm-constrained transportation mappings, and establishing local smoothness properties of the hidden state with respect to the learnable parameters.

Convergence of Gradient Descent for Recurrent Neural Networks: A Nonasymptotic Analysis

TL;DR

Abstract

in terms of the sequence length

, sample size

and ambient dimension

, and (ii) identifies the significant impact of long-term dependencies in the dynamical system on the convergence and network width bounds characterized by a cutoff point that depends on the Lipschitz continuity of the activation function. Remarkably, this analysis reveals that an appropriately-initialized recurrent neural network trained with

samples can achieve optimality with a network size

that scales only logarithmically with

. This sharply contrasts with the prior works that require high-order polynomial dependency of

to establish strong regularity conditions. Our results are based on an explicit characterization of the class of dynamical systems that can be approximated and learned by recurrent neural networks via norm-constrained transportation mappings, and establishing local smoothness properties of the hidden state with respect to the learnable parameters.

Paper Structure (19 sections, 16 theorems, 116 equations, 1 figure)

This paper contains 19 sections, 16 theorems, 116 equations, 1 figure.

Introduction
Our Contributions
Notation
Learning Dynamical Systems with Empirical Risk Minimization
Empirical Risk Minimization for Dynamical Systems
Elman-Type Recurrent Neural Networks
Gradient Descent for Recurrent Neural Networks
Main Results: An Overview
Infinite-Width Limit of Recurrent Neural Networks
Neural Tangent Kernel for Elman-Type Recurrent Neural Networks
Infinite-Width Limit of RNNs
Approximating $\mathscr{F}_{\bar{\nu}}$ by Randomly-Initialized RNNs
Convergence of GD for RNNs: Rates and Analysis
Local Lipschitz Continuity and Smoothness of the Hidden State
Convergence of Projected Gradient Descent for RNNs
...and 4 more sections

Key Result

Theorem 1

For large enough $\bm{\rho} \succ 0$, for any $\delta\in(0,1)$, $\tau\geq 1$ iterations of projected-gradient descent with the step-size $\eta=\frac{1}{T\sqrt{\tau}}$ yields with probability at least $1-\delta$ over the random initialization for $\bm{F}^\star\in\mathscr{F}_{\bm{\bar{\nu}}}$ with some $\alpha \in (0,1)$, where $\mu_T = \mathcal{O}(1)$ if $\alpha_m=\alpha+\frac{\rho_w}{\sqrt{m}}<\f

Figures (1)

Figure 1: Unfolded representation of an Elman-type recurrent neural network in the matrix notation $Y_t = c^\top\vec{\sigma}(\mathbf{W} H_{t-1}+\mathbf{U} X_{t-1})$, where $\vec{\sigma}:\mathbb{R}^m\mapsto\mathbb{R}^m$ applies $\sigma$ pointwise to each component of its input.

Theorems & Definitions (38)

Definition 2.1: Symmetric random initialization
Theorem : Theorem \ref{['thm:proj-gd']} (Informal) (Convergence of projected gradient descent for RNNs)
Corollary : Corollary \ref{['cor:avg-iterate-1']} (Informal)(Average-iterate convergence of projected gradient descent)
Corollary : Corollary \ref{['cor:sgd-avg-iterate-1']} (Informal)(Convergence of projected stochastic gradient descent)
Theorem : Theorem \ref{['thm:gd']} (Informal) (Convergence of gradient descent for RNNs)
Remark 2.2
Proposition 3.1
Remark 3.2
Proof 1: Proof of Proposition \ref{['prop:gradient']}
Proposition 3.3: NTK for Diagonal RNNs
...and 28 more

Convergence of Gradient Descent for Recurrent Neural Networks: A Nonasymptotic Analysis

TL;DR

Abstract

Convergence of Gradient Descent for Recurrent Neural Networks: A Nonasymptotic Analysis

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (38)