Inverse Approximation Theory for Nonlinear Recurrent Neural Networks

Shida Wang; Zhong Li; Qianxiao Li

Inverse Approximation Theory for Nonlinear Recurrent Neural Networks

Shida Wang, Zhong Li, Qianxiao Li

TL;DR

The paper addresses the problem of understanding when nonlinear RNNs can efficiently approximate nonlinear sequence-to-sequence relationships. It introduces a memory-based inverse (Bernstein-type) framework for nonlinear functionals, defines a memory function for nonlinear targets, and establishes a stable-approximation notion to ground optimization in practice. The main contribution is a Bernstein-type theorem showing that, under stable approximation, nonlinear RNN targets must exhibit exponential memory decay, extending previous linear results to nonlinear activations. The work also proposes stable reparameterization as a principled method to overcome the long-memory limitations and validates the theory with numerical experiments and public code, highlighting both fundamental limits and practical remedies for learning long-range dependencies.

Abstract

We prove an inverse approximation theorem for the approximation of nonlinear sequence-to-sequence relationships using recurrent neural networks (RNNs). This is a so-called Bernstein-type result in approximation theory, which deduces properties of a target function under the assumption that it can be effectively approximated by a hypothesis space. In particular, we show that nonlinear sequence relationships that can be stably approximated by nonlinear RNNs must have an exponential decaying memory structure - a notion that can be made precise. This extends the previously identified curse of memory in linear RNNs into the general nonlinear setting, and quantifies the essential limitations of the RNN architecture for learning sequential relationships with long-term memory. Based on the analysis, we propose a principled reparameterization method to overcome the limitations. Our theoretical results are confirmed by numerical experiments. The code has been released in https://github.com/radarFudan/Curse-of-memory

Inverse Approximation Theory for Nonlinear Recurrent Neural Networks

TL;DR

Abstract

Paper Structure (45 sections, 10 theorems, 109 equations, 12 figures, 1 table)

This paper contains 45 sections, 10 theorems, 109 equations, 12 figures, 1 table.

Introduction
Notation.
Problem formulation and prior results on linear RNNs
The approximation problem for sequence modeling
Forward and inverse approximation theorems.
The RNN architecture and prior results
Main results
Memory function for nonlinear functionals
Stable approximation
Bernstein-type approximation result for nonlinear RNNs
Interpretation of Theorem \ref{['thm:main_result_hardtanh']}.
Suitable parametrization enables stable approximation
Related work
Conclusion
Theoretical results and proofs
...and 30 more sections

Key Result

Theorem 3.9

Assume $\mathbf{H}$ is a sequence of bounded continuous, causal, regular and time-homogeneous functionals on $\mathcal{X}$ with decaying memory. Let the activation be in $\mathcal{A}_0 \cup \mathcal{A}_1$. Suppose $\mathbf{H}$ is $\beta_0$-stably approximated by a sequence of RNNs $\{\widehat{\mathb Then the memory function $\mathcal{M}(\mathbf{H})(t)$ of the target decays exponentially:

Figures (12)

Figure 1: Perturbation errors for linear functionals with different decaying memory. The anticipated limiting curve $E(\beta)$ is marked with a black dashed line. (a) For linear functional sequences with exponential decaying memory, there exists a perturbation radius $\beta_0$ such that the perturbation error $E(\beta)$ for $0 \leq \beta < \beta_0$ is continuous. (b) Approximation of linear functional sequences with polynomial decaying memory. As hidden dimension $m$ increases, the perturbation radius where the error remains small decreases, suggesting that there may not exist a $\beta_0$ achieving the stable approximation condition. The intersections of lines are shifting left as the hidden dimension $m$ increases. The anticipated limiting curve $E(\beta)$ is not continous for the polynomial decaying memory target.
Figure 2: Target with polynomial decaying memory + approximation (achieved at 1000 epochs) $\to$ no stability. Similar to the linear functional case, when approximating nonlinear functionals with polynomial decaying memory by tanh RNN, the intersections of curves are shifting left as the hidden dimension $m$ increases.
Figure 3: Stable approximation via RNNs implies exponential decaying memory. We construct several randomly-initialized RNN models as teacher models with large hidden dimension ($m=256$). When approximating the teacher model with a series of student RNN models, we can numerically verify the approximation's stability (left panel). We can apply a filtering: we only select those teacher models which both can be approximated, and the approximations are stable (with perturbation error $E_m(\beta)$ having a positive stability radius). We found that the only teachers that remain are those with exponential decaying memory functions. An example is shown in the right panel.
Figure 4: Stable approximation of linear functionals with polynomial decay memory by linear RNN with exp and softplus reparameterization. The limiting dashed curve $E(\beta)$ shall be continuous.
Figure 5: Memory function of sentiment scores for different words based on IMDB movie reviews using Bidirectional LSTM and stacked Bidirectional LSTM
...and 7 more figures

Theorems & Definitions (29)

Definition 2.1
Definition 3.1: Memory function of nonlinear functional sequences
Definition 3.2: Decaying memory
Remark 3.3
Definition 3.4
Definition 3.5: Stable approximation via parameterized models
Remark 3.6
Definition 3.7
Definition 3.8
Theorem 3.9
...and 19 more

Inverse Approximation Theory for Nonlinear Recurrent Neural Networks

TL;DR

Abstract

Inverse Approximation Theory for Nonlinear Recurrent Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (29)