Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?

Tokio Kajitsuka; Issei Sato

Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?

Tokio Kajitsuka, Issei Sato

TL;DR

This work addresses the expressivity gap of Transformers by reframing the self-attention computation through the Boltzmann operator and showing that softmax-based, rank-1 self-attention in a one-layer architecture can realize a contextual mapping that encodes the entire input sequence. By coupling this attention with two feed-forward networks, the model becomes a universal approximator of permutation-equivariant functions on a compact domain, while also demonstrating memorization for finite samples with a minimal depth. The results reconcile theoretical capabilities with practical Transformer designs and suggest that modest-depth architectures suffice for certain context-aware tasks, especially when a simple input-quantization step precedes attention. The analysis further extends to masked attention and invites future exploration of implications for formal languages and optimization dynamics.

Abstract

Existing analyses of the expressive capacity of Transformer models have required excessively deep layers for data memorization, leading to a discrepancy with the Transformers actually used in practice. This is primarily due to the interpretation of the softmax function as an approximation of the hardmax function. By clarifying the connection between the softmax function and the Boltzmann operator, we prove that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence. As a consequence, we show that one-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain.

Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?

TL;DR

Abstract

Paper Structure (19 sections, 12 theorems, 85 equations, 1 figure)

This paper contains 19 sections, 12 theorems, 85 equations, 1 figure.

Introduction
Related Works
Preliminaries
Notation
Transformer block
Attention is a Contextual Mapping
Problem setting
Background
Self-attention with hardmax
Self-attention with softmax
Applications of Contextual Mapping
Memorization capacity of one-layer Transformer
Transformers with one self-attention layer are universal approximators
Experiments
Conclusions
...and 4 more sections

Key Result

Theorem 1

$1$-layer multi-head self-attention $\mathcal{F}^{(SA)}_H$ with the hardmax function cannot be a contextual mapping.

Figures (1)

Figure 1: CoNLL-2003

Theorems & Definitions (29)

Definition 1: Tokenwise Separatedness
Definition 2: Contextual Mapping
Theorem 1
Theorem 2
proof : Proof Overview
Lemma 1
Remark 1: Masked self-attention
Corollary 1: Memorization capacity of one-layer Transformer
Remark 2: Parameter efficiency
Corollary 2: Memorization capacity of one-layer Transformer with positional encodings
...and 19 more

Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?

TL;DR

Abstract

Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (29)