Kernel Limit for a Class of Recurrent Neural Networks Trained on Ergodic Data Sequences

Samuel Chun-Hei Lam; Justin Sirignano; Konstantinos Spiliopoulos

Kernel Limit for a Class of Recurrent Neural Networks Trained on Ergodic Data Sequences

Samuel Chun-Hei Lam, Justin Sirignano, Konstantinos Spiliopoulos

TL;DR

A fixed point analysis is developed for the evolution of the RNN memory states, with convergence estimates in terms of the number of update steps and the number of hidden units.

Abstract

Mathematical methods are developed to characterize the asymptotics of recurrent neural networks (RNN) as the number of hidden units, data samples in the sequence, hidden state updates, and training steps simultaneously grow to infinity. In the case of an RNN with a simplified weight matrix, we prove the convergence of the RNN to the solution of an infinite-dimensional ODE coupled with the fixed point of a random algebraic equation. The analysis requires addressing several challenges which are unique to RNNs. In typical mean-field applications (e.g., feedforward neural networks), discrete updates are of magnitude $\mathcal{O}(1/N)$ and the number of updates is $\mathcal{O}(N)$. Therefore, the system can be represented as an Euler approximation of an appropriate ODE/PDE, which it will converge to as $N \rightarrow \infty$. However, the RNN hidden layer updates are $\mathcal{O}(1)$. Therefore, RNNs cannot be represented as a discretization of an ODE/PDE and standard mean-field techniques cannot be applied. Instead, we develop a fixed point analysis for the evolution of the RNN memory states, with convergence estimates in terms of the number of update steps and the number of hidden units. The RNN hidden layer is studied as a function in a Sobolev space, whose evolution is governed by the data sequence (a Markov chain), the parameter updates, and its dependence on the RNN hidden layer at the previous time step. Due to the strong correlation between updates, a Poisson equation must be used to bound the fluctuations of the RNN around its limit equation. These mathematical methods give rise to the neural tangent kernel (NTK) limits for RNNs trained on data sequences as the number of data samples and size of the neural network grow to infinity.

Kernel Limit for a Class of Recurrent Neural Networks Trained on Ergodic Data Sequences

TL;DR

A fixed point analysis is developed for the evolution of the RNN memory states, with convergence estimates in terms of the number of update steps and the number of hidden units.

Abstract

and the number of updates is

. Therefore, the system can be represented as an Euler approximation of an appropriate ODE/PDE, which it will converge to as

. However, the RNN hidden layer updates are

. Therefore, RNNs cannot be represented as a discretization of an ODE/PDE and standard mean-field techniques cannot be applied. Instead, we develop a fixed point analysis for the evolution of the RNN memory states, with convergence estimates in terms of the number of update steps and the number of hidden units. The RNN hidden layer is studied as a function in a Sobolev space, whose evolution is governed by the data sequence (a Markov chain), the parameter updates, and its dependence on the RNN hidden layer at the previous time step. Due to the strong correlation between updates, a Poisson equation must be used to bound the fluctuations of the RNN around its limit equation. These mathematical methods give rise to the neural tangent kernel (NTK) limits for RNNs trained on data sequences as the number of data samples and size of the neural network grow to infinity.

Paper Structure (28 sections, 25 theorems, 251 equations, 5 figures, 1 algorithm)

This paper contains 28 sections, 25 theorems, 251 equations, 5 figures, 1 algorithm.

Introduction
Assumptions, Data, and Model Architecture
Data generation
Recurrent Neural Network
Training the RNN parameters
Assumptions for the wide-network limit
Clipping the neural network output
Main Results
Dynamics of the RNN Hidden/Memory Layers
Dynamics of the RNN Outputs
Limit RNN minimises the average loss
Proof of Dynamics of RNN Memory Layer
Reduction to Initialisations
Weak Law of Large Numbers
Dynamics of the memory units at the asymptotic limit
...and 13 more sections

Key Result

Lemma 2.12

Fix $T>0$. If we choose $\gamma \in (0, (1-\beta)/2)$ as in assumption as:choice_of_gamma, then for all $k$ with $k/N \leq T$, there exists $C_T > 0$ such that

Figures (5)

Figure 1: Each curve represents the overall empirical distributions of the untrained hidden units in the memory states (the hidden memory units) from all simulation instances $\ell = 1, \ldots, 100$ for $N = 10^2, ... 10^6$ and time step $k \approx 50000$.
Figure 2: The plots of the time-averaged first and second moments of the hidden units for a sufficiently large $N$ (chosen to be $10^6$) and $p = 1,2$. The $x$-axis represents the number of time steps. We summarise the minimum/maximum of the simulated first and second moments of the time-averages for independent input sequences $X$ using a (seemingly invisible) grey band. The red line represents the mean of the time-averaged moments for all input sequences $X$, thus providing a Monte-Carlo estimate for the moments of the random fixed point. The fact that the realisations of the time averages all converge as $k \to \infty$ illustrates the ergodicity of the sequence $S^{i,N}_k(X;\theta)$.
Figure 3: $\rho_N(x) = g_{-2N^\gamma,-N^\gamma}(-x) \times g_{-2N^\gamma,-N^\gamma}(x)$
Figure 4: Empirical distributions of the untrained hidden memory units $\varsigma_{X_k,u^N_k}(W^i)$ for varying $N$ and large time step $k \approx 50000$. The grey lines represents the empirical distribution for a single set of the untrained hidden memory units $\nu^{N,\mathsf{path}}_k$, and the red line represents the empirical distribution of all untrained hidden memory units from all sets $\nu^{N,\mathsf{overall}}_k$.
Figure 5: The plot of time averages for $N = 10^k, k = 2,3,4,5,6$ and $p = 1,2$. The actual realizations of $\mathsf{timeAvg}^{N,p,\mathsf{path}}_T$ lie in the grey band, and the red line is the overall time average $\mathsf{timeAvg}^{N,p,\mathsf{overall}}_T$ as the Monte-Carlo estimate of $\mathbb{E}[\mathsf{timeAvg}^{N,p}_T]$. The desired converging behaviour is only exhibited when $N$ is sufficiently large.

Theorems & Definitions (59)

Definition 2.1: Wasserstein Metric
Remark 2.2
Example 2.4
Example 2.5
Example 2.9
Definition 2.10: Smooth clipping function CohenJiangSirignano
Lemma 2.12
proof
Remark 2.13
Lemma 3.1: Dynamics of RNN Memory Layer
...and 49 more

Kernel Limit for a Class of Recurrent Neural Networks Trained on Ergodic Data Sequences

TL;DR

Abstract

Kernel Limit for a Class of Recurrent Neural Networks Trained on Ergodic Data Sequences

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (59)