Table of Contents
Fetching ...

Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?

Tokio Kajitsuka, Issei Sato

TL;DR

This work addresses the expressivity gap of Transformers by reframing the self-attention computation through the Boltzmann operator and showing that softmax-based, rank-1 self-attention in a one-layer architecture can realize a contextual mapping that encodes the entire input sequence. By coupling this attention with two feed-forward networks, the model becomes a universal approximator of permutation-equivariant functions on a compact domain, while also demonstrating memorization for finite samples with a minimal depth. The results reconcile theoretical capabilities with practical Transformer designs and suggest that modest-depth architectures suffice for certain context-aware tasks, especially when a simple input-quantization step precedes attention. The analysis further extends to masked attention and invites future exploration of implications for formal languages and optimization dynamics.

Abstract

Existing analyses of the expressive capacity of Transformer models have required excessively deep layers for data memorization, leading to a discrepancy with the Transformers actually used in practice. This is primarily due to the interpretation of the softmax function as an approximation of the hardmax function. By clarifying the connection between the softmax function and the Boltzmann operator, we prove that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence. As a consequence, we show that one-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain.

Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?

TL;DR

This work addresses the expressivity gap of Transformers by reframing the self-attention computation through the Boltzmann operator and showing that softmax-based, rank-1 self-attention in a one-layer architecture can realize a contextual mapping that encodes the entire input sequence. By coupling this attention with two feed-forward networks, the model becomes a universal approximator of permutation-equivariant functions on a compact domain, while also demonstrating memorization for finite samples with a minimal depth. The results reconcile theoretical capabilities with practical Transformer designs and suggest that modest-depth architectures suffice for certain context-aware tasks, especially when a simple input-quantization step precedes attention. The analysis further extends to masked attention and invites future exploration of implications for formal languages and optimization dynamics.

Abstract

Existing analyses of the expressive capacity of Transformer models have required excessively deep layers for data memorization, leading to a discrepancy with the Transformers actually used in practice. This is primarily due to the interpretation of the softmax function as an approximation of the hardmax function. By clarifying the connection between the softmax function and the Boltzmann operator, we prove that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence. As a consequence, we show that one-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain.
Paper Structure (19 sections, 12 theorems, 85 equations, 1 figure)

This paper contains 19 sections, 12 theorems, 85 equations, 1 figure.

Key Result

Theorem 1

$1$-layer multi-head self-attention $\mathcal{F}^{(SA)}_H$ with the hardmax function cannot be a contextual mapping.

Figures (1)

  • Figure 1: CoNLL-2003

Theorems & Definitions (29)

  • Definition 1: Tokenwise Separatedness
  • Definition 2: Contextual Mapping
  • Theorem 1
  • Theorem 2
  • proof : Proof Overview
  • Lemma 1
  • Remark 1: Masked self-attention
  • Corollary 1: Memorization capacity of one-layer Transformer
  • Remark 2: Parameter efficiency
  • Corollary 2: Memorization capacity of one-layer Transformer with positional encodings
  • ...and 19 more