Table of Contents
Fetching ...

An Uncertainty Principle for Linear Recurrent Neural Networks

Alexandre François, Antonio Orvieto, Francis Bach

TL;DR

This paper characterizes linear recurrent neural networks' ability for stable and effective long-range modeling on a simple but core copy task by providing lower bounds of approximation, as well as explicit filters that achieve this lower bound up to constants.

Abstract

We consider linear recurrent neural networks, which have become a key building block of sequence modeling due to their ability for stable and effective long-range modeling. In this paper, we aim at characterizing this ability on a simple but core copy task, whose goal is to build a linear filter of order $S$ that approximates the filter that looks $K$ time steps in the past (which we refer to as the shift-$K$ filter), where $K$ is larger than $S$. Using classical signal models and quadratic cost, we fully characterize the problem by providing lower bounds of approximation, as well as explicit filters that achieve this lower bound up to constants. The optimal performance highlights an uncertainty principle: the optimal filter has to average values around the $K$-th time step in the past with a range~(width) that is proportional to $K/S$.

An Uncertainty Principle for Linear Recurrent Neural Networks

TL;DR

This paper characterizes linear recurrent neural networks' ability for stable and effective long-range modeling on a simple but core copy task by providing lower bounds of approximation, as well as explicit filters that achieve this lower bound up to constants.

Abstract

We consider linear recurrent neural networks, which have become a key building block of sequence modeling due to their ability for stable and effective long-range modeling. In this paper, we aim at characterizing this ability on a simple but core copy task, whose goal is to build a linear filter of order that approximates the filter that looks time steps in the past (which we refer to as the shift- filter), where is larger than . Using classical signal models and quadratic cost, we fully characterize the problem by providing lower bounds of approximation, as well as explicit filters that achieve this lower bound up to constants. The optimal performance highlights an uncertainty principle: the optimal filter has to average values around the -th time step in the past with a range~(width) that is proportional to .

Paper Structure

This paper contains 43 sections, 21 theorems, 132 equations, 7 figures, 3 tables.

Key Result

Theorem 1

Let $S$ be odd and $T = \frac{S-1}{2}$. The filter defined by Eq. eq:cab with $a_s = \exp\left(-\frac{\alpha}{K}\right)\exp\left(i\frac{\pi s}{K}\right)$ and $b_s \propto (-1)^s$ for $s \in \llbracket -T, T\rrbracket$While in our introduction, for clarity, we considered $s\in\llbracket1,S\rrbracket$

Figures (7)

  • Figure 1: Learning to shift-$K$ with linear recurrences exhibits an uncertainty principle. For fixed $S=250$, different values of $K$ induce different performances: the smaller the ratio $S/K$, the lower the peak of the filter and the larger the width. For a fixed memory size $S$, increasing the time horizon is feasible, but comes at the expense of resolution. For $K>S$, the width of the filter around the correct location is $K/S$.
  • Figure 2: Shown is the behavior of $\frac{C(e^{i\omega})}{D(e^{i\omega})}$, where $C(e^{i\omega})$ is the Fourier transform of our near-optimal filter in Theorem \ref{['thm:informal_upper']} and $D(e^{i\omega}) = e^{-iK\omega}$ is the Fourier transform of our Shift-$K$ filter. Perfect match between filters implies the ratio is $1$ for all $\omega$. If instead this equality holds in a window, then the filter would effectively act as a Shift-$K$ for inputs with frequencies $\Gamma(e^{-i\omega})$ in the same window. For $S=51, K=500$, we denote $T = \frac{S-1}{2}$ and plot the ratio $\frac{C(e^{i\omega})}{D(e^{i\omega})}$ with respect to $\Omega=\frac{K\omega}{\pi}$ (to dilate the space). The asymptotic ratio $\frac{C(e^{i\omega})}{D(e^{i\omega})}$ (yellow) from Theorem \ref{['convergence to window']}, the same ratio for linear models with ($b_s$) given by Eq. \ref{['Param_of_the_bs']} (green), and for ($b_s$) given by linear system inversion Eq. \ref{['bs linear system inversion']} (blue) are compared. The model effectively approximates the shift-$K$ operation, within the frequency window $[-\frac{\pi T}{K}, \frac{\pi T}{K}]$, while vanishing outside this window, leading to a time resolution (inverse of filter width) of $\frac{S}{K}$. This behavior underscores the uncertainty principle associated with the filter: for small $S/K$ ratios and uncorrelated data, the approximation holds over a narrow frequency range. As autocorrelation increases, the approximation domain shrinks, enhancing accuracy. In red, we show the perfect window (value of 1 on $[-\frac{\pi T}{K}, \frac{\pi T}{K}]$ and $0$ outside).
  • Figure 3: Poor performance of the filter for white noise data is due to its approximation of the complex exponential over a limited frequency window of size $\frac{\pi S}{K}$. Left: The target filter $\exp(-iK\omega)$ (blue) for $K=450$, and the approximated filter using linear recurrences (green) for $S=90$. The approximation is reasonably accurate within the frequency window of size $\frac{\pi S}{K}$, indicated by the dashed yellow lines. Outside this window, the filter is zero, demonstrating the inability of filters based on linear recurrences to perfectly memorize long-range data with broad spectra. Right: Contributions from all individual terms $\frac{b_s}{1 - a_s e^{-i\omega}}$ for $s\in\llbracket -45, 45 \rrbracket$. Each individual term captures one oscillation of the complex exponential, making their contributions highly localized. This design reflects the structure of the filter's parameters.
  • Figure 4: Initialization with regularly spaces phases enhances robustness and outperforms random initialization near the unit disk. Left. For $N=1500$ and $t^* = 200$, initialization using our filter defined in Eq. \ref{['Param_of_the_as']} and Eq. \ref{['Param_of_the_bs']}. Right. For $N=2250$ and $t^*=250$, the task consists of learning a shift-$K$ filter with $K^*=2000$. Here, $\rho = 0.7$.
  • Figure B.1: The autocorrelation factor $\rho$ determines the width of the spectral power density $\Gamma(e^{i\omega})$. The larger $\rho$, the narrower the spectral power density. This means that increasing $\rho$ in $\mathcal{L}_\text{freq}(c, d)$ narrows the bandwidth over which we evaluate the difference $\vert C(e^{i\omega}) - D(e^{i\omega})\vert^2$, leading to improved performance.
  • ...and 2 more figures

Theorems & Definitions (23)

  • Theorem 1: upper bound, informal
  • Theorem 2: Lower bound of the approximation error---white noise
  • Theorem 3: Lower bound of the approximation error---auto-correlated noise
  • Lemma 1
  • Lemma 2
  • Theorem 4: Upper bound of the error
  • Theorem 5
  • Proposition 1: Linear RNNs and convolution form
  • Proposition 2: Convolution
  • Definition 1
  • ...and 13 more