Table of Contents
Fetching ...

$\infty$-former: Infinite Memory Transformer

Pedro Henrique Martins, Zita Marinho, André F. T. Martins

TL;DR

The paper tackles long-context modeling in transformers by introducing ∞-former, which adds an unbounded long-term memory using continuous-space attention, decoupling attention cost from context length. It proposes unbounded memory via basis-function representation, two memory types (LTM and STM), and sticky memories to preserve important information. Through synthetic sorting, GPT-2 fine-tuning on Wikitext-103/PG-19, and CMU-DoG experiments, it demonstrates improved long-range retention and perplexity/match metrics, especially on data with long-range dependencies. This approach offers a scalable path to long-context modeling with fixed compute, at the cost of memory precision depending on N, and introduces practical mechanisms to prioritize relevant memories.

Abstract

Transformers are unable to model long-term memories effectively, since the amount of computation they need to perform grows with the context length. While variations of efficient transformers have been proposed, they all have a finite memory capacity and are forced to drop old information. In this paper, we propose the $\infty$-former, which extends the vanilla transformer with an unbounded long-term memory. By making use of a continuous-space attention mechanism to attend over the long-term memory, the $\infty$-former's attention complexity becomes independent of the context length, trading off memory length with precision. In order to control where precision is more important, $\infty$-former maintains "sticky memories" being able to model arbitrarily long contexts while keeping the computation budget fixed. Experiments on a synthetic sorting task, language modeling, and document grounded dialogue generation demonstrate the $\infty$-former's ability to retain information from long sequences.

$\infty$-former: Infinite Memory Transformer

TL;DR

The paper tackles long-context modeling in transformers by introducing ∞-former, which adds an unbounded long-term memory using continuous-space attention, decoupling attention cost from context length. It proposes unbounded memory via basis-function representation, two memory types (LTM and STM), and sticky memories to preserve important information. Through synthetic sorting, GPT-2 fine-tuning on Wikitext-103/PG-19, and CMU-DoG experiments, it demonstrates improved long-range retention and perplexity/match metrics, especially on data with long-range dependencies. This approach offers a scalable path to long-context modeling with fixed compute, at the cost of memory precision depending on N, and introduces practical mechanisms to prioritize relevant memories.

Abstract

Transformers are unable to model long-term memories effectively, since the amount of computation they need to perform grows with the context length. While variations of efficient transformers have been proposed, they all have a finite memory capacity and are forced to drop old information. In this paper, we propose the -former, which extends the vanilla transformer with an unbounded long-term memory. By making use of a continuous-space attention mechanism to attend over the long-term memory, the -former's attention complexity becomes independent of the context length, trading off memory length with precision. In order to control where precision is more important, -former maintains "sticky memories" being able to model arbitrarily long contexts while keeping the computation budget fixed. Experiments on a synthetic sorting task, language modeling, and document grounded dialogue generation demonstrate the -former's ability to retain information from long sequences.

Paper Structure

This paper contains 35 sections, 22 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: $\infty$-former's attention diagram with sequence of text, $X_t$, of size $L=2$ and STM of size $L_\mathrm{STM}=2$. Circles represent input embeddings or hidden states (depending on the layer) for head $h$ and query $i$. Both the self-attention and the attention over the LTM are performed in parallel for each head and query.
  • Figure 2: Diagram of the unbounded memory update procedure. This is performed in parallel for each embedding dimension, and repeated throughout the input sequence. We propose two alternatives to select the positions used for the function evaluation: linearly spaced or sticky memories.
  • Figure 3: Left: Sorting task accuracy for sequences of length 4,000, 8,000, and 16,000. Right: Sorting task accuracy vs regression mean error, when varying the number of basis functions, for sequences of length 8,000.
  • Figure 4: Examples of answers generated by $\infty$-former on a dialogue about the movie "Home Alone". The excerpts from the LTM that are more attended to throughout the utterances generation are highlighted on each color, correspondingly.
  • Figure 5: Phrases that hold larger spaces of the LTM, when using sticky memories, for two dialogue examples (in App. \ref{['sec:examples']}).
  • ...and 7 more figures