Recurrent Memory Transformer

Aydar Bulatov; Yuri Kuratov; Mikhail S. Burtsev

Recurrent Memory Transformer

Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev

TL;DR

This work targets the challenge of long-range dependencies and context length in Transformer models by introducing the Recurrent Memory Transformer (RMT), which augments each input segment with fixed memory tokens and propagates memory across segments through segment-level recurrence. The backbone Transformer remains unchanged; memory tokens serve as dedicated read/write storage enabling more compact, long-term representations and gradient flow via BPTT. Empirical results show RMT matches or surpasses Transformer-XL on long-sequence tasks (copy, reverse, associative retrieval, quadratic equations) with smaller memory footprints and competitive language modeling performance on WikiText-103 and enwik8; combining RMT with Transformer-XL cache yields further gains. The approach also demonstrates compatibility with pretrained models for downstream long-text tasks, highlighting practical potential for memory-augmented Transformers in reasoning and algorithmic domains.

Abstract

Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has to be stored mostly in the same element-wise representations. Moreover, the length of an input sequence is limited by quadratic computational complexity of self-attention. In this work, we propose and study a memory-augmented segment-level recurrent Transformer (RMT). Memory allows to store and process local and global information as well as to pass information between segments of the long sequence with the help of recurrence. We implement a memory mechanism with no changes to Transformer model by adding special memory tokens to the input or output sequence. Then the model is trained to control both memory operations and sequence representations processing. Results of experiments show that RMT performs on par with the Transformer-XL on language modeling for smaller memory sizes and outperforms it for tasks that require longer sequence processing. We show that adding memory tokens to Tr-XL is able to improve its performance. This makes Recurrent Memory Transformer a promising architecture for applications that require learning of long-term dependencies and general purpose in memory processing, such as algorithmic tasks and reasoning.

Recurrent Memory Transformer

TL;DR

Abstract

Paper Structure (13 sections, 4 equations, 9 figures, 5 tables)

This paper contains 13 sections, 4 equations, 9 figures, 5 tables.

Introduction
Related work
Recurrent Memory Transformer
Experiments
Results
Conclusions
Training details and additional results
Algorithmic tasks
Associative retrieval
Quadratic equations
Enwik8
WikiText-103
Operations with Memory

Figures (9)

Figure 1: Recurrent Memory Transformer. Memory is added as tokens to the input sequence and memory output is passed to the next segment. During training gradients flow from the current segment through memory to the previous segment.
Figure 2: Comparison of Recurrent Memory Transformer (RMT) and Transformer-XL architectures. Recurrent Memory Transformer augments Transformer with global memory tokens and passes them to allow a segment-level recurrence. Special read/write memory tokens are added to the input sequence. Multiple memory tokens can be used in each read/write block. Updated representations of write memory are passed to the next segment. During training, RMT uses BPTT to propagate gradient to previous segments through memory tokens representation. Effective context length for recurrence with memory is not limited by the depth of a network which is the case for the cache of Transformer-XL.
Figure 3: RMT outperforms Transformer-XL on Copy and Reverse tasks as a number of segments increases. Panels show test set per-character accuracy on copy, reverse, and associative retrieval tasks (from left to right). Memory/cache size equals to the length of a segment for both models. RMT does not pass gradients between segments in this experiment. MT results are the same as for the Baseline. Source/target sequence lengths for copy, reverse, and associative retrieval tasks: 24/48, 24/24, 10/1.
Figure 4: RMT scales better with a number of segments and sequence size. (a) RMT is able to solve copy task perfectly up to 9 segments for a fixed sequence length of 360, while Tr-XL fails. (b) RMT learns to use memory of the same fixed size (60 tokens) more effectively than TR-XL as a sequence length to copy increases (a segment size is 120 for the both models).
Figure 5: Deeper BPPT unrolling improves RMT scores on WikiText-103 (a) Visible context at training time can be increased by deeper BPTT unrolls for RMT or enlarging cache for Tr-XL. Larger visible context leads to lower perplexity for both models (marker size corresponds to memory size). (b) Recurrence improves performance of RMT compared to Tr-XL for the same memory sizes.
...and 4 more figures

Recurrent Memory Transformer

TL;DR

Abstract

Recurrent Memory Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (9)