Table of Contents
Fetching ...

Approximation Bounds for Recurrent Neural Networks with Application to Regression

Yuling Jiao, Yang Wang, Bokai Yan

TL;DR

This work analyzes the approximation power of deep ReLU RNNs for sequence-to-sequence mappings and provides statistical guarantees for nonparametric regression with dependent data. It proves a provable equivalence between RNNs and FNNs and establishes the first explicit approximation rate for deep ReLU RNNs approximating past-dependent Hölder functions, with width and depth scaling as $W\asymp J\log J$ and $L\asymp I\log I$, and error bounded by $(JI)^{-2\gamma/(d_x t)}$. It then derives nonasymptotic, minimax-optimal risk bounds for RNN-based regression under both exponentially $\beta$-mixing and i.i.d. data, and characterizes how rates degrade under algebraic mixing. The results provide concrete statistical guarantees for RNNs in sequential settings and highlight avenues for improving efficiency via structural assumptions or novel architectures.

Abstract

We study the approximation capacity of deep ReLU recurrent neural networks (RNNs) and explore the convergence properties of nonparametric least squares regression using RNNs. We derive upper bounds on the approximation error of RNNs for Hölder smooth functions, in the sense that the output at each time step of an RNN can approximate a Hölder function that depends only on past and current information, termed a past-dependent function. This allows a carefully constructed RNN to simultaneously approximate a sequence of past-dependent Hölder functions. We apply these approximation results to derive non-asymptotic upper bounds for the prediction error of the empirical risk minimizer in regression problem. Our error bounds achieve minimax optimal rate under both exponentially $β$-mixing and i.i.d. data assumptions, improving upon existing ones. Our results provide statistical guarantees on the performance of RNNs.

Approximation Bounds for Recurrent Neural Networks with Application to Regression

TL;DR

This work analyzes the approximation power of deep ReLU RNNs for sequence-to-sequence mappings and provides statistical guarantees for nonparametric regression with dependent data. It proves a provable equivalence between RNNs and FNNs and establishes the first explicit approximation rate for deep ReLU RNNs approximating past-dependent Hölder functions, with width and depth scaling as and , and error bounded by . It then derives nonasymptotic, minimax-optimal risk bounds for RNN-based regression under both exponentially -mixing and i.i.d. data, and characterizes how rates degrade under algebraic mixing. The results provide concrete statistical guarantees for RNNs in sequential settings and highlight avenues for improving efficiency via structural assumptions or novel architectures.

Abstract

We study the approximation capacity of deep ReLU recurrent neural networks (RNNs) and explore the convergence properties of nonparametric least squares regression using RNNs. We derive upper bounds on the approximation error of RNNs for Hölder smooth functions, in the sense that the output at each time step of an RNN can approximate a Hölder function that depends only on past and current information, termed a past-dependent function. This allows a carefully constructed RNN to simultaneously approximate a sequence of past-dependent Hölder functions. We apply these approximation results to derive non-asymptotic upper bounds for the prediction error of the empirical risk minimizer in regression problem. Our error bounds achieve minimax optimal rate under both exponentially -mixing and i.i.d. data assumptions, improving upon existing ones. Our results provide statistical guarantees on the performance of RNNs.
Paper Structure (22 sections, 14 theorems, 161 equations, 2 figures)

This paper contains 22 sections, 14 theorems, 161 equations, 2 figures.

Key Result

Proposition 1

Let $\mathcal{N}_1 \in \mathcal{RNN}_{d_{x,1}, d_{y,1}}(W_1, L_1)$ and $\mathcal{N}_2 \in \mathcal{RNN}_{d_{x,2}, d_{y,2}}(W_2, L_2)$.

Figures (2)

  • Figure 1: An illustration of the network architecture of $Y = \mathcal{N}(X) = \mathcal{Q} \circ \mathcal{R}_3 \circ \mathcal{R}_2 \circ \mathcal{R}_1 \circ \mathcal{P}(X)$, where $\mathcal{P}$ denotes the embedding map, $\mathcal{R}_1$, $\mathcal{R}_2$, and $\mathcal{R}_3$ are the recurrent layers, and $\mathcal{Q}$ represents the projection map. In this network, the depth is $L = 3$ and the length of the input sequence is $N=5$.
  • Figure 2: An illustration of Theorem \ref{['theorem: 1']}. It holds simultaneously that $\mathcal{N}(X)[1] \approx f^{(1)}(x[1])$, $\mathcal{N}(X)[2] \approx f^{(2)}(x[1], x[2])$, and similarly, $\mathcal{N}(X)[N] \approx f^{(N)}(x[1], \ldots, x[N])$.

Theorems & Definitions (34)

  • Proposition 1
  • Proposition 2
  • Definition 1: Hölder classes
  • Definition 2: Past-dependency
  • Theorem 3
  • Definition 3: Covering number
  • Lemma 4
  • Lemma 5
  • Definition 4: Stationarity
  • Definition 5: $\beta$-mixing
  • ...and 24 more