AllMem: A Memory-centric Recipe for Efficient Long-context Modeling

Ziming Wang; Xiang Wang; Kailong Peng; Lang Qin; Juan Gabriel Kostelec; Christos Sourmpis; Axel Laborieux; Qinghai Guo

AllMem: A Memory-centric Recipe for Efficient Long-context Modeling

Ziming Wang, Xiang Wang, Kailong Peng, Lang Qin, Juan Gabriel Kostelec, Christos Sourmpis, Axel Laborieux, Qinghai Guo

TL;DR

AllMem tackles the long-context bottleneck in decoder-only Transformers by fusing sliding-window attention with a nonlinear, test-time trainable memory, achieving $O(L)$ compute and $O(1)$ memory for ultra-long sequences. The method preserves short-context performance through distillation while introducing a memory-augmented, multi-scale representation that mitigates forgetting via a per-channel fusion gate and online TT memory updates. Experimental results show near-lossless performance on 37k LongBench and superior results on 128k-context InfiniteBench with an 8k window, often outperforming full attention while reducing resource use by up to ≈$9 imes$ in FLOPs and cache. This work provides a robust framework for converting pretrained LLMs into memory-efficient, long-context capable models and opens avenues for integrating dedicated external memory systems into multi-level memory hierarchies.

Abstract

Large Language Models (LLMs) encounter significant performance bottlenecks in long-sequence tasks due to the computational complexity and memory overhead inherent in the self-attention mechanism. To address these challenges, we introduce \textsc{AllMem}, a novel and efficient hybrid architecture that integrates Sliding Window Attention (SWA) with non-linear Test-Time Training (TTT) memory networks. \textsc{AllMem} enables models to effectively scale to ultra-long contexts while mitigating catastrophic forgetting. This approach not only overcomes the representation constraints typical of linear memory models but also significantly reduces the computational and memory footprint during long-sequence inference. Furthermore, we implement a Memory-Efficient Fine-Tuning strategy to replace standard attention layers in pre-trained models with memory-augmented sliding window layers. This framework facilitates the efficient transformation of any off-the-shelf pre-trained LLM into an \textsc{AllMem}-based architecture. Empirical evaluations confirm that our 4k window model achieves near-lossless performance on 37k LongBench with a marginal 0.83 drop compared to full attention. Furthermore, on InfiniteBench at a 128k context, our 8k window variant outperforms full attention, which validates the effectiveness of our parameterized memory in mitigating noise and maintaining robust long-range modeling without the prohibitive costs of global attention.

AllMem: A Memory-centric Recipe for Efficient Long-context Modeling

TL;DR

AllMem tackles the long-context bottleneck in decoder-only Transformers by fusing sliding-window attention with a nonlinear, test-time trainable memory, achieving

compute and

memory for ultra-long sequences. The method preserves short-context performance through distillation while introducing a memory-augmented, multi-scale representation that mitigates forgetting via a per-channel fusion gate and online TT memory updates. Experimental results show near-lossless performance on 37k LongBench and superior results on 128k-context InfiniteBench with an 8k window, often outperforming full attention while reducing resource use by up to ≈

in FLOPs and cache. This work provides a robust framework for converting pretrained LLMs into memory-efficient, long-context capable models and opens avenues for integrating dedicated external memory systems into multi-level memory hierarchies.

Abstract

Paper Structure (15 sections, 13 equations, 3 figures, 3 tables)

This paper contains 15 sections, 13 equations, 3 figures, 3 tables.

Introduction
Preliminaries
Method
Design Principles
AllMem
Distilation pipeline
Data Preprocessing
Distillation Method
On-Policy Distillation
Experimental details
Results
Performance on Short-Sequence Benchmarks
Performance on Long-Sequence Benchmarks
Computational Cost and Memory Overhead
Conclusion

Figures (3)

Figure 1: Model architecture of the memory module: (Left) A single decoder layer structure, where the Token Mixer consists of two parallel branches: a sliding-window attention (SWA) module for modeling local, fine-grained dependencies, and a long-term AllMem memory unit for capturing global, persistent semantic patterns. (Right) The internal meta-parameter structure of the TTT-enabled memory unit, including learnable momentum decay rates, learning rates, output gating branches, and QKV projection weights.
Figure 2: Online learning pipeline of the memory module: To prevent future information leakage, we employ a strictly ordered sequence of operations--memory read, weight normalization, and memory update--ensuring stable and robust TTT dynamics.
Figure 3: Comparison of FLOPs and cache size across different model sizes for AllMem memory, full attention, sink-based sliding window attention, and Channel Mixer, as sequence length increases. The sliding window size is 4k, and the update chunk size is 2k.

AllMem: A Memory-centric Recipe for Efficient Long-context Modeling

TL;DR

Abstract

AllMem: A Memory-centric Recipe for Efficient Long-context Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (3)