Recurrent Memory Transformer
Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev
TL;DR
This work targets the challenge of long-range dependencies and context length in Transformer models by introducing the Recurrent Memory Transformer (RMT), which augments each input segment with fixed memory tokens and propagates memory across segments through segment-level recurrence. The backbone Transformer remains unchanged; memory tokens serve as dedicated read/write storage enabling more compact, long-term representations and gradient flow via BPTT. Empirical results show RMT matches or surpasses Transformer-XL on long-sequence tasks (copy, reverse, associative retrieval, quadratic equations) with smaller memory footprints and competitive language modeling performance on WikiText-103 and enwik8; combining RMT with Transformer-XL cache yields further gains. The approach also demonstrates compatibility with pretrained models for downstream long-text tasks, highlighting practical potential for memory-augmented Transformers in reasoning and algorithmic domains.
Abstract
Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has to be stored mostly in the same element-wise representations. Moreover, the length of an input sequence is limited by quadratic computational complexity of self-attention. In this work, we propose and study a memory-augmented segment-level recurrent Transformer (RMT). Memory allows to store and process local and global information as well as to pass information between segments of the long sequence with the help of recurrence. We implement a memory mechanism with no changes to Transformer model by adding special memory tokens to the input or output sequence. Then the model is trained to control both memory operations and sequence representations processing. Results of experiments show that RMT performs on par with the Transformer-XL on language modeling for smaller memory sizes and outperforms it for tasks that require longer sequence processing. We show that adding memory tokens to Tr-XL is able to improve its performance. This makes Recurrent Memory Transformer a promising architecture for applications that require learning of long-term dependencies and general purpose in memory processing, such as algorithmic tasks and reasoning.
