Table of Contents
Fetching ...

HiPPO: Recurrent Memory with Optimal Polynomial Projections

Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, Christopher Re

TL;DR

HiPPO reframes memory in sequential data as online function approximation, projecting the history onto a polynomial basis with a time-varying measure. The framework derives tractable ODE or recurrence updates for the projection coefficients, unifying the LegT, LagT, and LegS memory schemes and recovering the LMU in a principled way. The novel LegS variant uses a scaled Legendre measure to achieve timescale robustness without hyperparameters, with strong gradient behavior and theoretical error bounds, and it yields state-of-the-art results on permuted MNIST while showing resilience to timescale shifts and missing data. Empirically, HiPPO-integrated RNNs demonstrate superior long-range memory, fast online updates, and scalability to millions of steps, suggesting broad applicability to long-horizon sequence modeling and time-series tasks.

Abstract

A central problem in learning from sequential data is representing cumulative history in an incremental fashion as more data is processed. We introduce a general framework (HiPPO) for the online compression of continuous signals and discrete time series by projection onto polynomial bases. Given a measure that specifies the importance of each time step in the past, HiPPO produces an optimal solution to a natural online function approximation problem. As special cases, our framework yields a short derivation of the recent Legendre Memory Unit (LMU) from first principles, and generalizes the ubiquitous gating mechanism of recurrent neural networks such as GRUs. This formal framework yields a new memory update mechanism (HiPPO-LegS) that scales through time to remember all history, avoiding priors on the timescale. HiPPO-LegS enjoys the theoretical benefits of timescale robustness, fast updates, and bounded gradients. By incorporating the memory dynamics into recurrent neural networks, HiPPO RNNs can empirically capture complex temporal dependencies. On the benchmark permuted MNIST dataset, HiPPO-LegS sets a new state-of-the-art accuracy of 98.3%. Finally, on a novel trajectory classification task testing robustness to out-of-distribution timescales and missing data, HiPPO-LegS outperforms RNN and neural ODE baselines by 25-40% accuracy.

HiPPO: Recurrent Memory with Optimal Polynomial Projections

TL;DR

HiPPO reframes memory in sequential data as online function approximation, projecting the history onto a polynomial basis with a time-varying measure. The framework derives tractable ODE or recurrence updates for the projection coefficients, unifying the LegT, LagT, and LegS memory schemes and recovering the LMU in a principled way. The novel LegS variant uses a scaled Legendre measure to achieve timescale robustness without hyperparameters, with strong gradient behavior and theoretical error bounds, and it yields state-of-the-art results on permuted MNIST while showing resilience to timescale shifts and missing data. Empirically, HiPPO-integrated RNNs demonstrate superior long-range memory, fast online updates, and scalability to millions of steps, suggesting broad applicability to long-horizon sequence modeling and time-series tasks.

Abstract

A central problem in learning from sequential data is representing cumulative history in an incremental fashion as more data is processed. We introduce a general framework (HiPPO) for the online compression of continuous signals and discrete time series by projection onto polynomial bases. Given a measure that specifies the importance of each time step in the past, HiPPO produces an optimal solution to a natural online function approximation problem. As special cases, our framework yields a short derivation of the recent Legendre Memory Unit (LMU) from first principles, and generalizes the ubiquitous gating mechanism of recurrent neural networks such as GRUs. This formal framework yields a new memory update mechanism (HiPPO-LegS) that scales through time to remember all history, avoiding priors on the timescale. HiPPO-LegS enjoys the theoretical benefits of timescale robustness, fast updates, and bounded gradients. By incorporating the memory dynamics into recurrent neural networks, HiPPO RNNs can empirically capture complex temporal dependencies. On the benchmark permuted MNIST dataset, HiPPO-LegS sets a new state-of-the-art accuracy of 98.3%. Finally, on a novel trajectory classification task testing robustness to out-of-distribution timescales and missing data, HiPPO-LegS outperforms RNN and neural ODE baselines by 25-40% accuracy.

Paper Structure

This paper contains 142 sections, 12 theorems, 128 equations, 10 figures, 6 tables.

Key Result

Theorem 1

For LegT and LagT, the $\mathop{\mathrm{hippo}}\nolimits$ operators satisfying def:hippo are given by linear time-invariant (LTI) ODEs $\frac{d}{d t} c(t) = - A c(t) + B f(t)$, where $A \in \mathbb{R}^{N \times N}, B \in \mathbb{R}^{N \times 1}$:

Figures (10)

  • Figure 1: Illustration of the HiPPO framework. (1) For any function $f$, (2) at every time $t$ there is an optimal projection $g^{(t)}$ of $f$ onto the space of polynomials, with respect to a measure $\mu^{(t)}$ weighing the past. (3) For an appropriately chosen basis, the corresponding coefficients $c(t)\in\mathbb{R}^N$ representing a compression of the history of $f$ satisfy linear dynamics. (4) Discretizing the dynamics yields an efficient closed-form recurrence for online compression of time series $(f_k)_{k\in\mathbb{N}}$.
  • Figure 2: HiPPO incorporated into a simple RNN model. $\mathop{\mathrm{hippo}}\nolimits$ is the HiPPO memory operator which projects the history of the $f_t$ features depending on the chosen measure.
  • Figure 3: Input function and its reconstructions.
  • Figure 4: Absolute error for different discretization methods. Forward and backward Euler are generally not very accurate, while bilinear yields more accurate approximation.
  • Figure 5: Illustration of HiPPO measures. At time $t_0$, the history of a function $f(x)_{x \le t_0}$ is summarized by polynomial approximation with respect to the measure $\mu^{(t_0)}$ (blue), and similarly for time $t_1$ (purple). (Left) The Translated Legendre measure (LegT) assigns weight in the window $[t-\theta, t]$. For small $t$, $\mu^{(t)}$ is supported on a region $x < 0$ where $f$ is not defined. When $t$ is large, the measure is not supported near $0$, causing the projection of $f$ to forget the beginning of the function. (Middle) The Translated Laguerre (LagT) measure decays the past exponentially. It does not forget, but also assigns weight on $x < 0$. (Right) The Scaled Legendre measure (LegS) weights the entire history $[0, t]$ uniformly.
  • ...and 5 more figures

Theorems & Definitions (16)

  • Definition 1
  • Theorem 1
  • Theorem 2
  • Proposition 3
  • Proposition 4
  • Proposition 5
  • Proposition 6
  • Theorem 7
  • Proposition 8
  • Proposition 9
  • ...and 6 more