Memory of recurrent networks: Do we compute it right?

Giovanni Ballarin; Lyudmila Grigoryeva; Juan-Pablo Ortega

Memory of recurrent networks: Do we compute it right?

Giovanni Ballarin, Lyudmila Grigoryeva, Juan-Pablo Ortega

TL;DR

It is proved that when the Krylov structure of the linear MC is ignored, a gap between the theoretical MC and its empirical counterpart is introduced and robust numerical approaches are developed by exploiting a result of MC neutrality with respect to the input mask matrix.

Abstract

Numerical evaluations of the memory capacity (MC) of recurrent neural networks reported in the literature often contradict well-established theoretical bounds. In this paper, we study the case of linear echo state networks, for which the total memory capacity has been proven to be equal to the rank of the corresponding Kalman controllability matrix. We shed light on various reasons for the inaccurate numerical estimations of the memory, and we show that these issues, often overlooked in the recent literature, are of an exclusively numerical nature. More explicitly, we prove that when the Krylov structure of the linear MC is ignored, a gap between the theoretical MC and its empirical counterpart is introduced. As a solution, we develop robust numerical approaches by exploiting a result of MC neutrality with respect to the input mask matrix. Simulations show that the memory curves that are recovered using the proposed methods fully agree with the theory.

Memory of recurrent networks: Do we compute it right?

TL;DR

Abstract

Paper Structure (24 sections, 7 theorems, 69 equations, 6 figures, 1 algorithm)

This paper contains 24 sections, 7 theorems, 69 equations, 6 figures, 1 algorithm.

Introduction
Code
Notation
Linear Memory Capacity
Memory Capacity
Fischer Memory
Relation Between Memory Capacities and Fischer Memory
Linear Models Generically Have Maximal Memory
Monte Carlo Estimation of Memory Capacities
Naïve Algebraic Memory Estimation
Robust Memory Computation
Input Mask Memory Neutrality
Another Formula for Memory Capacity
Krylov Conditioning
Memory Gaps and Krylov Subspace Squeezing
...and 9 more sections

Key Result

Proposition 2.2

Consider a linear ESN model in eq:ESN_def_1-eq:ESN_def_2 and let $\boldsymbol{\zeta} = \mathbf{0}$. Let $A$ be diagonalizable and such that $\rho(A) < 1$, with $\rho(A)$ the spectral radius of the matrix $A$. Suppose that all the eigenvalues of $A$ are distinct. Let any of the following equivalent c If $({z}_t)_{t\in\mathbb{Z}_-}$ is a weakly stationary white noise process, then $\text{MC} = N$.

Figures (6)

Figure 1: Illustration of memory capacity inflation due to the inconsistent estimation of $\textnormal{MC}_\tau$ for LESN with $N = 100$, orthogonal $A$ with $\rho(A) = 0.9$, and input mask $\mathbf{C}= \overline{\mathbf{C}} / \lVert\overline{\mathbf{C}}\rVert$ with $\mathbf{C} = (\overline{c}_{i})_{i=1}^N \sim\ \text{i.i.d.}\ \mathcal{N}(0,1)$: (a) memory curves $\widehat{\textnormal{MC}}_\tau(T)$; (b) bar chart of normalized total memory capacity $\widehat{\textnormal{MC}}(T) / N$. Memory curves $\widehat{\textnormal{MC}}_\tau(T)$ are computed for $\tau \in \{0, 1, ..., 5N\}$ (in (a), $\widehat{\textnormal{MC}}_\tau(T)$ is plotted only up to $\tau=2N$ for the sake of clarity). Estimators are computed from simulated $(z_t)_{t=1}^T \sim \text{i.i.d.}\ \mathcal{N}(0,1)$, with $T\in \{1000, 1500,\ldots, 10000\}$.
Figure 2: Eigenvalue plot (in absolute values) for ${G}_{\mathbf{x}}$ for various types of connectivity matrices. ${G}_{\mathbf{x}}$ was computed using $1000$ series terms in \ref{['eq:MC_tau_AC']}, a connectivity matrix $A \in \mathbb{M}_N$ with spectral radius $\rho(A) = 0.9$ and a unit norm input mask $\mathbf{C} \in \mathbb{R}^N$. Computations are performed in MATLAB with the standard double-precision of floating point numbers $eps =2^{-52}\approx 2.2\times 10^{-16}$ marked with the black horizontal solid line.
Figure 3: Krylov subspace squeezing effects as measured using the norm of the orthogonal component for reservoir matrix $A = (A_{ij}) \in \mathbb{M}_N$, $\rho(A) = 0.9$, sampled $\mathcal{N}(0,1)$ in (a), $\mathcal{U}(-1,1)$ in (b), sparse standard Gaussian with the degree of sparsity $0.1$, $sp\mathcal{N}(0,1,0.1)$, in (c), and orthogonal standard Gaussian in (d), and for Krylov matrix $K_m \in \mathbb{M}_{N, m}$, where in all plots $N = 100$ and $m = 5N$. Input mask is $\mathbf{C} = \boldsymbol{\iota}_N=(1, \ldots, 1)^\top \in \mathbb{R}^N$. The black dotted line shows the exponential decay of leading eigenvalue $\rho(A)$, while the black dashed line illustrates the approximate decay law derived using random matrix theory in Section \ref{['subsection:rmt_insights']}. A solid black horizontal line shows the numerical double-precision of floating numbers in MATLAB, $eps =2^{-52}\approx 2.22\times 10^{-16}$.
Figure 4: Memory capacity curves of LESNs with connectivity matrix $A = (A_{ij}) \in \mathbb{M}_N$ with $\rho(A) = 0.9$. In all panels $A_{i,j}$ are sampled as i.i.d. degree $0.1$ sparse standard normal, $\ sp\mathcal{N}(0,1,0.1)$, and the input mask $\mathbf{C} = (c_{i}) \in \mathbb{R}^N$ is sampled as $\mathcal{N}(0,1)$ in (a), $\mathcal{U}(-1,1)$ in (b), degree $0.1$ sparse Gaussian, $sp\mathcal{N}(0,1,0.1)$, in (c), and degree $0.1$ sparse uniform, $sp\,\mathcal{U}(0,1,0.1)$, in (d). $\mathbf{C}$ is normalized after sampling to have a unit norm. Total MC is estimated as the sum of $\text{MC}_\tau$'s up to $1.5 N$ terms. For OSM+ the input mask $\mathbf{C}$ is resampled $L = 1000$ times to compute the average memory curve (lines) and $90\%$ frequency bands for $\text{MC}_\tau$ (shaded).
Figure 5: Memory capacity curves of LESNs with input mask $\mathbf{C} = (c_{i}) \in \mathbb{R}^N$ and connectivity matrix $A = (A_{ij}) \in \mathbb{M}_N$, $\rho(A) = 0.9$, sampled from different standard distributions (in panel (d) $sp_C\mathcal{N}(0,1,0.1,0.7)$ stands for sparse standard Gaussian with sparsity degree $0.1$ and condition number $0.7$). In all panels $c_{i} \sim\ \text{i.i.d.}\ sp\mathcal{N}(0,1,0.1)$. $\mathbf{C}$ is normalized after sampling to have a unit norm. Total MC is computed as the sum of $\text{MC}_\tau$'s up to $1.5N$ terms. For OSM+ the input mask $\mathbf{C}$ is resampled $L = 1000$ times to compute the average memory curve (lines) and $90\%$ frequency bands for $\text{MC}_\tau$ (shaded).
...and 1 more figures

Theorems & Definitions (15)

Example 2.1: Delay reservoir
Proposition 2.2: LESN Memory Capacity
proof
Proposition 2.3: RC21
Proposition 2.4: Standardization of state-space realizations, RC15
Definition 2.5
Proposition 2.6
proof
Example 2.7: Cyclic reservoirs
Proposition 3.1: Input mask neutrality
...and 5 more

Memory of recurrent networks: Do we compute it right?

TL;DR

Abstract

Memory of recurrent networks: Do we compute it right?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (15)