On the Optimal Memorization Capacity of Transformers

Tokio Kajitsuka; Issei Sato

On the Optimal Memorization Capacity of Transformers

Tokio Kajitsuka, Issei Sato

TL;DR

The paper analyzes the memorization capacity of Transformer architectures under two settings: next-token prediction and sequence-to-sequence prediction. It shows that memorization can be achieved with \\tilde{O}(\\sqrt{N}) parameters in the next-token setting (nearly independent of input length $n$) and with \\tilde{O}(\\sqrt{nN}) parameters in the seq-to-seq setting under hardmax, with matching lower bounds up to logarithmic factors. The core technique uses a contextual mapping to assign unique sequence IDs and demonstrates that a single self-attention layer can effectively identify input sequences, while the feed-forward network becomes the bottleneck for assigning labels in the seq-to-seq case. These results illuminate the parameter-efficiency of self-attention reusing information across tokens and suggest that the practical advantages of Transformers may stem from optimization and generalization properties rather than raw memorization capacity. The work also discusses bit-complexity considerations and open problems for softmax memorization bounds and extensions to other equivariant architectures.

Abstract

Recent research in the field of machine learning has increasingly focused on the memorization capacity of Transformers, but how efficient they are is not yet well understood. We demonstrate that Transformers can memorize labels with $\tilde{O}(\sqrt{N})$ parameters in a next-token prediction setting for $N$ input sequences of length $n$, which is proved to be optimal up to logarithmic factors. This indicates that Transformers can efficiently perform memorization with little influence from the input length $n$ owing to the benefit of parameter sharing. We also analyze the memorization capacity in the sequence-to-sequence setting, and find that $\tilde{O}(\sqrt{nN})$ parameters are not only sufficient, but also necessary at least for Transformers with hardmax. These results suggest that while self-attention mechanisms can efficiently identify input sequences, the feed-forward network becomes a bottleneck when associating a label to each token.

On the Optimal Memorization Capacity of Transformers

TL;DR

) and with \\tilde{O}(\\sqrt{nN}) parameters in the seq-to-seq setting under hardmax, with matching lower bounds up to logarithmic factors. The core technique uses a contextual mapping to assign unique sequence IDs and demonstrates that a single self-attention layer can effectively identify input sequences, while the feed-forward network becomes the bottleneck for assigning labels in the seq-to-seq case. These results illuminate the parameter-efficiency of self-attention reusing information across tokens and suggest that the practical advantages of Transformers may stem from optimization and generalization properties rather than raw memorization capacity. The work also discusses bit-complexity considerations and open problems for softmax memorization bounds and extensions to other equivariant architectures.

Abstract

parameters in a next-token prediction setting for

input sequences of length

, which is proved to be optimal up to logarithmic factors. This indicates that Transformers can efficiently perform memorization with little influence from the input length

owing to the benefit of parameter sharing. We also analyze the memorization capacity in the sequence-to-sequence setting, and find that

parameters are not only sufficient, but also necessary at least for Transformers with hardmax. These results suggest that while self-attention mechanisms can efficiently identify input sequences, the feed-forward network becomes a bottleneck when associating a label to each token.

Paper Structure (28 sections, 23 theorems, 154 equations, 5 figures, 1 table)

This paper contains 28 sections, 23 theorems, 154 equations, 5 figures, 1 table.

Introduction
Related Work
Preliminaries
Notation
Transformer block
Bit complexity
Memorization Capacity of Transformers
Problem setting
Next-token prediction setting
Upper bound
Proof outline of Theorem 4.1
Lower bound
Sequence-to-sequence prediction setting
Conclusions
Definition of Multisets
...and 13 more sections

Key Result

Theorem 4.1

Let $({\bm{X}}^{(1)}, y^{(1)}),\dots,({\bm{X}}^{(N)}, y^{(N)}) \in \mathbb{R}^{d \times n} \times [C]$ be a sequence of input-label pairs such that Then, there exists a Transformer $\mathcal{N}:\mathbb{R}^{d \times n} \to \mathbb{R}^n$ with width $14$ and depth $\tilde{O}(\sqrt{N})$ that memorizes the dataset, that is, holds for every $i \in [N]$, as long as $n,C,r\delta^{-1} = N^{O(1)}$ as $N \

Figures (5)

Figure 1: Training loss on MultiNLI dataset
Figure 2: Accuracy on MultiNLI dataset
Figure 4: Training loss on IMDb dataset
Figure 5: Accuracy on IMDb dataset
Figure 7: Memorization capacity, that is, the minimum size of Transformers required for memorizing MultiNLI dataset with size $N=600,\dots,1700$ in increments of $100$. In this figure, the depth $\# \mathrm{blocks}$ of the two token-wise feed-forward networks $\mathcal{F}^{(\mathrm{FF})}_1$ and $\mathcal{F}^{(\mathrm{FF})}_2$ in \ref{['eq:model_for_experiment']} is used as the variable on the vertical axis to control the size of the network. Each model was trained using full-batch gradient descent for $1000$ epochs, and the best-performing model was selected after running ten trials of hyperparameter tuning with Optuna.

Theorems & Definitions (47)

Remark 3.1
Theorem 4.1: Next-token prediction
Remark 4.1: Deep sets
Remark 4.2: Embedding layer
Remark 4.3: Dependence on $d$
Definition 4.1: Contextual mapping
Remark 4.4: Optimality in terms of bit counts
Theorem 4.2: Lower bound
Corollary 4.1: Seq-to-seq prediction
Remark 4.5: Sparse Transformers
...and 37 more

On the Optimal Memorization Capacity of Transformers

TL;DR

Abstract

On the Optimal Memorization Capacity of Transformers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (47)