On the Expressive Power of Transformers for Maxout Networks and Continuous Piecewise Linear Functions

Linyan Gu; Lihua Yang; Feng Zhou

On the Expressive Power of Transformers for Maxout Networks and Continuous Piecewise Linear Functions

Linyan Gu, Lihua Yang, Feng Zhou

TL;DR

This paper establishes an explicit approximation of maxout networks by Transformer networks while preserving comparable model complexity and develops a framework to analyze the approximation of continuous piecewise linear functions by Transformers and quantitatively characterize their expressivity via the number of linear regions, which grows exponentially with depth.

Abstract

Transformer networks have achieved remarkable empirical success across a wide range of applications, yet their theoretical expressive power remains insufficiently understood. In this paper, we study the expressive capabilities of Transformer architectures. We first establish an explicit approximation of maxout networks by Transformer networks while preserving comparable model complexity. As a consequence, Transformers inherit the universal approximation capability of ReLU networks under similar complexity constraints. Building on this connection, we develop a framework to analyze the approximation of continuous piecewise linear functions by Transformers and quantitatively characterize their expressivity via the number of linear regions, which grows exponentially with depth. Our analysis establishes a theoretical bridge between approximation theory for standard feedforward neural networks and Transformer architectures. It also yields structural insights into Transformers: self-attention layers implement max-type operations, while feedforward layers realize token-wise affine transformations.

On the Expressive Power of Transformers for Maxout Networks and Continuous Piecewise Linear Functions

TL;DR

Abstract

Paper Structure (24 sections, 15 theorems, 140 equations)

This paper contains 24 sections, 15 theorems, 140 equations.

Introduction
Related Works
Expressive capacity of feedforward neural networks.
Expressive capacity of Transformers.
Organization of This Paper
Preliminaries
Notations
Transformer Networks
Transformer Block.
Transformer Network.
Positional Embedding.
Auxiliary Token.
Approximation for Maxout Networks by Transformers
Proof Sketch.
Step 1: Approximating the affine maps.
...and 9 more sections

Key Result

Theorem 3.1

Let $p \le T$, let $\Omega \subset \mathbb{R}^{n \times T}$ be compact, and let $f \in \mathcal{T}_{\mathrm{max}}(n \times T, p, m \times T)$. Then, there exists a hardmax-based Transformer network such that Moreover, for any $\epsilon > 0$, the corresponding softmax-based Transformer $\mathcal{N}_S^\lambda$ satisfies provided that the scaling parameter satisfies $\lambda = \mathcal{O}(1/\epsil

Theorems & Definitions (28)

Theorem 3.1: Approximation of a Maxout Layer with $p \le T$
Remark 3.1
Remark 3.2
Remark 3.3
Theorem 3.2: Approximation of Deep Maxout Networks with $p \le T$
Remark 3.4
Corollary 3.3: Universal Approximation of ReLU Networks
Theorem 3.4: Universal Approximation of Shallow Maxout Networks
Theorem 3.5: Universal Approximation of Deep Maxout Networks
Remark 3.5
...and 18 more

On the Expressive Power of Transformers for Maxout Networks and Continuous Piecewise Linear Functions

TL;DR

Abstract

On the Expressive Power of Transformers for Maxout Networks and Continuous Piecewise Linear Functions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (28)