Table of Contents
Fetching ...

Attention is a smoothed cubic spline

Zehua Lai, Lek-Heng Lim, Yucong Liu

TL;DR

The paper recasts transformers through the lens of spline theory, showing that a $ReLU$-activated attention module is a cubic spline and that the full transformer is a composition of such spline components. Replacing $ReLU$ with $SoftMax$ smooths these into $C^2$ or smoother functions, connecting to Vaswani et al.'s original design, while the authors demonstrate that, under the Pierce–Birkhoff conjecture, every spline corresponds to a ReLU-encoder. By leveraging Veronese maps, they establish a bidirectional equivalence between splines and encoders/decoders, providing a unified mathematical framework for understanding transformer architecture and depth as progressive increases in spline degree. The work also discusses practical implications, suggesting SoftPlus as a smoother alternative and highlighting the PB conjecture as a meaningful theoretical link between classical approximation theory and modern deep learning. Overall, the paper reveals that transformers are natural, well-understood spline constructions, with deep connections to longstanding conjectures in approximation theory and potential guidance for smoother, more efficient variants.

Abstract

We highlight a perhaps important but hitherto unobserved insight: The attention module in a transformer is a smoothed cubic spline. Viewed in this manner, this mysterious but critical component of a transformer becomes a natural development of an old notion deeply entrenched in classical approximation theory. More precisely, we show that with ReLU-activation, attention, masked attention, encoder-decoder attention are all cubic splines. As every component in a transformer is constructed out of compositions of various attention modules (= cubic splines) and feed forward neural networks (= linear splines), all its components -- encoder, decoder, and encoder-decoder blocks; multilayered encoders and decoders; the transformer itself -- are cubic or higher-order splines. If we assume the Pierce-Birkhoff conjecture, then the converse also holds, i.e., every spline is a ReLU-activated encoder. Since a spline is generally just $C^2$, one way to obtain a smoothed $C^\infty$-version is by replacing ReLU with a smooth activation; and if this activation is chosen to be SoftMax, we recover the original transformer as proposed by Vaswani et al. This insight sheds light on the nature of the transformer by casting it entirely in terms of splines, one of the best known and thoroughly understood objects in applied mathematics.

Attention is a smoothed cubic spline

TL;DR

The paper recasts transformers through the lens of spline theory, showing that a -activated attention module is a cubic spline and that the full transformer is a composition of such spline components. Replacing with smooths these into or smoother functions, connecting to Vaswani et al.'s original design, while the authors demonstrate that, under the Pierce–Birkhoff conjecture, every spline corresponds to a ReLU-encoder. By leveraging Veronese maps, they establish a bidirectional equivalence between splines and encoders/decoders, providing a unified mathematical framework for understanding transformer architecture and depth as progressive increases in spline degree. The work also discusses practical implications, suggesting SoftPlus as a smoother alternative and highlighting the PB conjecture as a meaningful theoretical link between classical approximation theory and modern deep learning. Overall, the paper reveals that transformers are natural, well-understood spline constructions, with deep connections to longstanding conjectures in approximation theory and potential guidance for smoother, more efficient variants.

Abstract

We highlight a perhaps important but hitherto unobserved insight: The attention module in a transformer is a smoothed cubic spline. Viewed in this manner, this mysterious but critical component of a transformer becomes a natural development of an old notion deeply entrenched in classical approximation theory. More precisely, we show that with ReLU-activation, attention, masked attention, encoder-decoder attention are all cubic splines. As every component in a transformer is constructed out of compositions of various attention modules (= cubic splines) and feed forward neural networks (= linear splines), all its components -- encoder, decoder, and encoder-decoder blocks; multilayered encoders and decoders; the transformer itself -- are cubic or higher-order splines. If we assume the Pierce-Birkhoff conjecture, then the converse also holds, i.e., every spline is a ReLU-activated encoder. Since a spline is generally just , one way to obtain a smoothed -version is by replacing ReLU with a smooth activation; and if this activation is chosen to be SoftMax, we recover the original transformer as proposed by Vaswani et al. This insight sheds light on the nature of the transformer by casting it entirely in terms of splines, one of the best known and thoroughly understood objects in applied mathematics.
Paper Structure (25 sections, 10 theorems, 70 equations, 2 figures)

This paper contains 25 sections, 10 theorems, 70 equations, 2 figures.

Key Result

Theorem 3.1

Every neural network $\varphi : \mathbb{R}^n \to \mathbb{R}$ is a linear spline, and every linear spline $\ell : \mathbb{R}^n \to \mathbb{R}$ can be represented by a neural network with at most $\lceil\log _2(n+1)\rceil+1$ depth.

Figures (2)

  • Figure 1: Transformer as flow chart.
  • Figure 2: Attention module as flow chart

Theorems & Definitions (21)

  • Definition 2.1: Partition
  • Definition 2.2: Spline
  • Conjecture 2.3: Pierce--Birkhoff
  • Theorem 3.1: Arora--Basu--Mianjy--Mukherjee
  • Lemma 3.2
  • proof
  • Theorem 3.3: Components of a transformer as splines
  • proof
  • Lemma 3.4
  • proof
  • ...and 11 more