Table of Contents
Fetching ...

Gated Recurrent Neural Networks with Weighted Time-Delay Feedback

N. Benjamin Erichson, Soon Hoe Lim, Michael W. Mahoney

TL;DR

τ-GRU addresses the challenge of modeling long-term dependencies in sequential data by deriving a gated recurrent unit from a continuous-time delay differential equation with weighted time-delay feedback. The resulting architecture discretizes to a GRU-like update that includes a delayed term weighted by a gate and a per-component weight, providing gradient-buffering effects to mitigate vanishing gradients. The authors prove the continuous-time model has a unique solution and demonstrate, through extensive experiments on diverse tasks (Adding, HAR-2, IMDB, sequential image classification, climate dynamics, and frequency classification), that τ-GRU converges faster and often generalizes better than state-of-the-art RNNs and some SSMs in the small-data regime. Limitations include the single-delay assumption; future work could explore multiple/distributed delays and noise-injected variants.

Abstract

In this paper, we present a novel approach to modeling long-term dependencies in sequential data by introducing a gated recurrent unit (GRU) with a weighted time-delay feedback mechanism. Our proposed model, named $τ$-GRU, is a discretized version of a continuous-time formulation of a recurrent unit, where the dynamics are governed by delay differential equations (DDEs). We prove the existence and uniqueness of solutions for the continuous-time model and show that the proposed feedback mechanism can significantly improve the modeling of long-term dependencies. Our empirical results indicate that $τ$-GRU outperforms state-of-the-art recurrent units and gated recurrent architectures on a range of tasks, achieving faster convergence and better generalization.

Gated Recurrent Neural Networks with Weighted Time-Delay Feedback

TL;DR

τ-GRU addresses the challenge of modeling long-term dependencies in sequential data by deriving a gated recurrent unit from a continuous-time delay differential equation with weighted time-delay feedback. The resulting architecture discretizes to a GRU-like update that includes a delayed term weighted by a gate and a per-component weight, providing gradient-buffering effects to mitigate vanishing gradients. The authors prove the continuous-time model has a unique solution and demonstrate, through extensive experiments on diverse tasks (Adding, HAR-2, IMDB, sequential image classification, climate dynamics, and frequency classification), that τ-GRU converges faster and often generalizes better than state-of-the-art RNNs and some SSMs in the small-data regime. Limitations include the single-delay assumption; future work could explore multiple/distributed delays and noise-injected variants.

Abstract

In this paper, we present a novel approach to modeling long-term dependencies in sequential data by introducing a gated recurrent unit (GRU) with a weighted time-delay feedback mechanism. Our proposed model, named -GRU, is a discretized version of a continuous-time formulation of a recurrent unit, where the dynamics are governed by delay differential equations (DDEs). We prove the existence and uniqueness of solutions for the continuous-time model and show that the proposed feedback mechanism can significantly improve the modeling of long-term dependencies. Our empirical results indicate that -GRU outperforms state-of-the-art recurrent units and gated recurrent architectures on a range of tasks, achieving faster convergence and better generalization.
Paper Structure (31 sections, 8 theorems, 72 equations, 10 figures, 12 tables)

This paper contains 31 sections, 8 theorems, 72 equations, 10 figures, 12 tables.

Key Result

Theorem 1

Let $t_0 \in \mathbb{R}$ and $\phi \in C$ be given. There exists a unique solution $h(t) = h(t, \phi)$ of Eq. eq_gendde, defined on $[t_0 - \tau, t_0 + A]$ for any $A > 0$. In particular, the solution exists for all $t \geq t_0$, and for all $t \geq t_0$, where $K = 1 + \|W_1\| + \|W_2\| + \|W_4\|/4$.

Figures (10)

  • Figure 1: Test accuracy for nCIFAR chang2018antisymmetricrnn versus Google-12 warden2018speech. nCIFAR requires a recurrent unit with long-term dependency capabilities, while Google-12 requires a highly expressive unit. Our $\tau$-GRU is able to improve performance on both tasks, relative to existing state-of-the-art alternatives, including LEM rusch2022long.
  • Figure 2: Results for the adding task. We show the one standard deviation bands for LEM and our $\tau$-GRU. On average, $\tau$-GRU converges faster, and obtains a lower MSE on the adding task.
  • Figure 3: Test accuracy for psMNIST.
  • Figure 4: Sensitivity analysis of $\tau$-GRU on psMNIST. The green envelope represent $\pm 1$ s.d. around the mean.
  • Figure 5: Hidden state dynamics of the DDE based RNNs with $\tau=0.5$ and $\tau=1$, and the ODE based RNN ($\tau=0$). All RNNs are driven by the same cosine input signal.
  • ...and 5 more figures

Theorems & Definitions (13)

  • Theorem 1: Existence and uniqueness of solution for continuous-time $\tau$-GRU
  • Proposition 1
  • Theorem 2: Adapted from Theorem 3.7 in smith2011introduction
  • Theorem 3: Existence and uniqueness of solution for continuous-time $\tau$-GRU
  • proof
  • Proposition 2
  • proof
  • Lemma 1
  • proof
  • Proposition 3
  • ...and 3 more