Table of Contents
Fetching ...

Towards Compressive and Scalable Recurrent Memory

Yunchong Song, Jushi Kai, Liming Lu, Kaixi Qiu, Zhouhan Lin

TL;DR

This work tackles the quadratic attention bottleneck in long-context language modeling by introducing Elastic Memory, a memory architecture grounded in the HiPPO framework for online function approximation. It encodes the history into a fixed-size state via optimal HiPPO compression and retrieves a history summary using a polynomial-sampling reconstruction bank integrated into trapezoidal attention, enabling 32k+ context handling with comparable or fewer parameters. Across three long-document benchmarks, Elastic Memory achieves state-of-the-art perplexity and long-context metrics, often outperforming stronger baselines with significantly less memory and competitive training speed, even as model and memory scale. A decoupled test-time retriever and robustness to local-context corruption demonstrate practical flexibility and reliability, making the approach well-suited for scalable, real-world long-context applications.

Abstract

Transformers face a quadratic bottleneck in attention when scaling to long contexts. Recent approaches introduce recurrent memory to extend context beyond the current window, yet these often face a fundamental trade-off between theoretical principles and practical scalability. To address this, we introduce Elastic Memory, a novel memory architecture grounded in the HiPPO framework for online function approximation. Elastic Memory treats historical sequence as samples from continuous signals, applying optimal online compression to encode them into a fixed-size memory state. For retrieval, we propose a flexible \textit{polynomial sampling} mechanism that reconstructs a history summary from this compressed state. Elastic Memory consistently outperformed baselines on long-context (32k+) datasets across three domains. With equal parameters, it beat Memorizing Transformer by 16x memory and outperformed Melodi at all memory sizes, even when Melodi had 30% more parameters. When scaling model size, Elastic Memory stayed ahead of all baselines and was significantly faster than Melodi at 4x size. Furthermore, its decoupled design allows for injecting inductive biases at test-time to boost performance.

Towards Compressive and Scalable Recurrent Memory

TL;DR

This work tackles the quadratic attention bottleneck in long-context language modeling by introducing Elastic Memory, a memory architecture grounded in the HiPPO framework for online function approximation. It encodes the history into a fixed-size state via optimal HiPPO compression and retrieves a history summary using a polynomial-sampling reconstruction bank integrated into trapezoidal attention, enabling 32k+ context handling with comparable or fewer parameters. Across three long-document benchmarks, Elastic Memory achieves state-of-the-art perplexity and long-context metrics, often outperforming stronger baselines with significantly less memory and competitive training speed, even as model and memory scale. A decoupled test-time retriever and robustness to local-context corruption demonstrate practical flexibility and reliability, making the approach well-suited for scalable, real-world long-context applications.

Abstract

Transformers face a quadratic bottleneck in attention when scaling to long contexts. Recent approaches introduce recurrent memory to extend context beyond the current window, yet these often face a fundamental trade-off between theoretical principles and practical scalability. To address this, we introduce Elastic Memory, a novel memory architecture grounded in the HiPPO framework for online function approximation. Elastic Memory treats historical sequence as samples from continuous signals, applying optimal online compression to encode them into a fixed-size memory state. For retrieval, we propose a flexible \textit{polynomial sampling} mechanism that reconstructs a history summary from this compressed state. Elastic Memory consistently outperformed baselines on long-context (32k+) datasets across three domains. With equal parameters, it beat Memorizing Transformer by 16x memory and outperformed Melodi at all memory sizes, even when Melodi had 30% more parameters. When scaling model size, Elastic Memory stayed ahead of all baselines and was significantly faster than Melodi at 4x size. Furthermore, its decoupled design allows for injecting inductive biases at test-time to boost performance.
Paper Structure (69 sections, 32 equations, 13 tables, 1 algorithm)