Fading memory as inductive bias in residual recurrent networks

Igor Dubinin; Felix Effenberger

Fading memory as inductive bias in residual recurrent networks

Igor Dubinin, Felix Effenberger

TL;DR

The paper investigates how architectural inductive biases from residual connections shape fading memory and learning dynamics in recurrent nets. It introduces weakly coupled residual recurrent networks (WCRNNs) that possess well-defined Lyapunov exponents, enabling explicit control over memory timescales; for linear residuals, the network-wide Lyapunov spectrum approximates the log-eigenvalues of the residual matrix, $LE_{net} \approx \log\lambda_{residual}$, with subcritical, critical, and supercritical regimes. The authors show that near the edge of chaos offers the best trade-off between learning efficiency and stability, and demonstrate that different residual forms—rotational, heterogeneous, and non-linear—provide dataset-informed inductive biases that enhance practical expressivity across benchmarks such as sMNIST, psMNIST, ADD, and sCIFAR10; they also extend findings to non-linear residuals and propose a weakly coupled residual initialization for Elman RNNs. These results suggest principled design principles for memory-focused RNNs, enabling improved performance on long-range sequence tasks without constraining weight matrices. The work has potential implications for both artificial systems and neuroscience-inspired models, where operating near criticality with controlled memory is advantageous for learning and generalization.

Abstract

Residual connections have been proposed as an architecture-based inductive bias to mitigate the problem of exploding and vanishing gradients and increased task performance in both feed-forward and recurrent networks (RNNs) when trained with the backpropagation algorithm. Yet, little is known about how residual connections in RNNs influence their dynamics and fading memory properties. Here, we introduce weakly coupled residual recurrent networks (WCRNNs) in which residual connections result in well-defined Lyapunov exponents and allow for studying properties of fading memory. We investigate how the residual connections of WCRNNs influence their performance, network dynamics, and memory properties on a set of benchmark tasks. We show that several distinct forms of residual connections yield effective inductive biases that result in increased network expressivity. In particular, those are residual connections that (i) result in network dynamics at the proximity of the edge of chaos, (ii) allow networks to capitalize on characteristic spectral properties of the data, and (iii) result in heterogeneous memory properties. In addition, we demonstrate how our results can be extended to non-linear residuals and introduce a weakly coupled residual initialization scheme that can be used for Elman RNNs.

Fading memory as inductive bias in residual recurrent networks

TL;DR

, with subcritical, critical, and supercritical regimes. The authors show that near the edge of chaos offers the best trade-off between learning efficiency and stability, and demonstrate that different residual forms—rotational, heterogeneous, and non-linear—provide dataset-informed inductive biases that enhance practical expressivity across benchmarks such as sMNIST, psMNIST, ADD, and sCIFAR10; they also extend findings to non-linear residuals and propose a weakly coupled residual initialization for Elman RNNs. These results suggest principled design principles for memory-focused RNNs, enabling improved performance on long-range sequence tasks without constraining weight matrices. The work has potential implications for both artificial systems and neuroscience-inspired models, where operating near criticality with controlled memory is advantageous for learning and generalization.

Abstract

Paper Structure (14 sections, 22 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 14 sections, 22 equations, 10 figures, 2 tables, 1 algorithm.

Introduction
Background and motivation
Dynamical systems analysis
Learning dynamics
Weakly coupled residual recurrent networks
Experiments
Critical residuals
Rotational residuals
Heterogeneous residuals
Non-linear residuals
Discussion
Conclusion
Algorithm for computation of Lyapunov exponents
Supplementary figures

Figures (10)

Figure 1: Schematic representation of the WCRNN model for the sMNIST classification task. Each 28x28 pixel MNIST digit is serialized and presented to the network as a time series of length $D=784$. $s_t$ and $x_t$ denote the stimulus and network amplitude configurations at the discrete time step $t$$(1\leq t \leq 784)$, respectively. Orange circles indicate RNN nodes (unrolled over time) and green circles indicate the output units for the 10 digit classes, respectively. Line colors indicate the input type. The total recurrent input is shown in orange and consists of residual input as mediated by the residual map $R$ shown in blue, the recurrent input as mediated by the recurrent weight matrix $W_{xx}$ shown in red, and the external input mediated by an input projection matrix $W_{sx}$ shown in gray. The readout weights $W_{xo}$ of a linear readout performed at $t=784$ are shown in green. In the case of the ADD datasets, the configuration is analogous, except for adjustments in the input and output layers (2d input, one output unit).
Figure 1: Dynamics of eigenvalues of variational term $\mathbf{V}_{x}(f^t)$ before training and after 200 training epochs. Lines show trajectories of 20 randomly chosen eigenvalues over time for a randomly chosen input digit. Colors indicate network type, strongly supercritical have $r=1.01$, weakly supercritical have $r=1.0025$, critical have $r=1$, weakly subcritical have $r=0.995$, strongly subcritical have $r=0.95$. A. ADD100 before training; B. ADD100 after training; C. ADD400 before training; D. ADD400 after training; E. psMNIST before training; F. psMNIST after training;
Figure 2: WCRNN performance and dynamics on sMNIST. Colors indicate network type; strongly subcritical ($r=0.95$), weakly subcritical ($r=0.995$), critical ($r=1$), weakly supercritical ($r=1.0025$), strongly supercritical ($r=1.01$). A. Test accuracy on sMNIST as a function of training iterations over 200 training epochs. B. Dynamics of eigenvalues of variational term $\mathbf{V}_{x}(f^t)$ before training. Lines show trajectories of 20 randomly chosen eigenvalues over time for a randomly chosen input digit. C. Rank plot of the eigenvalue magnitudes of the Hessian of the loss function $\mathbf{H}_{w}(L)$ before training. Lines show eigenvalues that were computed for a randomly chosen batch of the sMNIST test set. D. Evolution of norms of BPTT gradients as a function of time. Lines show gradient norms that were computed over a random input batch before training.
Figure 2: Learning trajectories for WCRNNs subject to different learning rates and coupling constants $\gamma$, trained on the ADD100 dataset. Lines show test accuracy measured in RMS as a function of training iterations over 150 training epochs. A. Learning rate: $\eta=0.1$, coupling constant: $\gamma=0.01$. Note that the learning dynamics are unstable. B. Learning rate: $\eta=0.1$, coupling constant: $\gamma=0.001$. Note that the decrease in $\gamma$ results in more stable but slower learning dynamics compared to A. C. Learning rate: $\eta=0.01$, coupling constant: $\gamma=0.01$. Note that the decrease in learning rate results in more stable but also slower learning dynamics. D. Learning rate: $\eta=0.01$, coupling constant: $\gamma=0.001$. Note the very stable and also very slow learning dynamics.
Figure 3: Practical expressivity of WCRNN networks as a function of the value of the residual connection strength $r$ for the ADD and MNIST datasets. Lyapunov exponents of presented WCRNNs are equal to $\log{r}$. All networks have a value of $\gamma = 0.01$. Lines show mean values over 5 network instances with random weight weight initialization, shaded areas show the range between minimal and maximal values. A. Best test accuracy on the ADD task as measured by root mean squared error (RMS) attained over 150 training epochs for the ADD datasets. B. The number of training iterations to reach a defined minimal performance (MP) of 0.05 RMS error (see main text) for ADD datasets. C. Best test accuracy for MNIST dataset over 200 training epochs. D. The number of training iterations to reach a MP of 50% test accuracy for MNIST datasets.
...and 5 more figures

Fading memory as inductive bias in residual recurrent networks

TL;DR

Abstract

Fading memory as inductive bias in residual recurrent networks

Authors

TL;DR

Abstract

Table of Contents

Figures (10)