Table of Contents
Fetching ...

Approximation of Permutation Invariant Polynomials by Transformers: Efficient Construction in Column-Size

Naoki Takeshita, Masaaki Imaizumi

TL;DR

This work shows that column-symmetric polynomials on matrices can be universally approximated by Transformers with a single attention head, with a width that scales as $12\cdot(2d)^sN$ and depth $2sL+3s$, and an error that decays as $8^s\cdot N^{-L}$. The approach constructs monomial column-symmetric polynomials via rank-based decomposition and inductively builds higher-rank terms using a combination of feed-forward and attention layers, achieving parameter efficiency by keeping the number of parameters independent of the input column count $n$. The main contributions are a constructive proof, explicit architecture parameters, and detailed error analyses that establish how depth, width, and rank influence approximation quality for column-symmetric polynomials on matrix inputs. The results highlight the potential of deep Transformers for symmetry-aware function approximation with favorable parameter efficiency, and they discuss practical considerations, such as the impact of $d$, $s$, and positional encoding on scaling and applicability.

Abstract

Transformers are a type of neural network that have demonstrated remarkable performance across various domains, particularly in natural language processing tasks. Motivated by this success, research on the theoretical understanding of transformers has garnered significant attention. A notable example is the mathematical analysis of their approximation power, which validates the empirical expressive capability of transformers. In this study, we investigate the ability of transformers to approximate column-symmetric polynomials, an extension of symmetric polynomials that take matrices as input. Consequently, we establish an explicit relationship between the size of the transformer network and its approximation capability, leveraging the parameter efficiency of transformers and their compatibility with symmetry by focusing on the algebraic properties of symmetric polynomials.

Approximation of Permutation Invariant Polynomials by Transformers: Efficient Construction in Column-Size

TL;DR

This work shows that column-symmetric polynomials on matrices can be universally approximated by Transformers with a single attention head, with a width that scales as and depth , and an error that decays as . The approach constructs monomial column-symmetric polynomials via rank-based decomposition and inductively builds higher-rank terms using a combination of feed-forward and attention layers, achieving parameter efficiency by keeping the number of parameters independent of the input column count . The main contributions are a constructive proof, explicit architecture parameters, and detailed error analyses that establish how depth, width, and rank influence approximation quality for column-symmetric polynomials on matrix inputs. The results highlight the potential of deep Transformers for symmetry-aware function approximation with favorable parameter efficiency, and they discuss practical considerations, such as the impact of , , and positional encoding on scaling and applicability.

Abstract

Transformers are a type of neural network that have demonstrated remarkable performance across various domains, particularly in natural language processing tasks. Motivated by this success, research on the theoretical understanding of transformers has garnered significant attention. A notable example is the mathematical analysis of their approximation power, which validates the empirical expressive capability of transformers. In this study, we investigate the ability of transformers to approximate column-symmetric polynomials, an extension of symmetric polynomials that take matrices as input. Consequently, we establish an explicit relationship between the size of the transformer network and its approximation capability, leveraging the parameter efficiency of transformers and their compatibility with symmetry by focusing on the algebraic properties of symmetric polynomials.

Paper Structure

This paper contains 32 sections, 11 theorems, 72 equations, 7 figures, 1 table.

Key Result

Theorem 1

Let $f(\boldsymbol{X})$ be an arbitrary degree-$s$ column-symmetric polynomial over $[0,1]^{d\times n}$ with positive coefficients, satisfying $\lVert f\rVert_{L^{\infty}} \leq 1$: i.e. $\max_{\boldsymbol{X}\in [0,1]^{d\times n}} |f(\boldsymbol{X})| \leq 1$. Then, for any $N, L \in \mathbb{N}_{+}$, which has only a single attention head.

Figures (7)

  • Figure 1: An Example of a ReLU FNN with width $N=4$ and depth $L=2$.
  • Figure 2: Architecture of Transformers
  • Figure 3: The illustration of $m_{(1,1),(1,0)}(\boldsymbol{X})$
  • Figure 4: $T_1, T_2$ and $T_3$ are illustrated in blue, red and green respectively.
  • Figure 5: $\widetilde{f}_1(x)$ (in blue) and $\widetilde{f}_2(x)$ (in green) approximating the target function $x \mapsto x^2$ (in red)
  • ...and 2 more figures

Theorems & Definitions (26)

  • Definition 1: ReLU feed-forward neural network
  • Definition 2: Transformer
  • Definition 3
  • Definition 4
  • Theorem 1
  • Example 1
  • Example 2
  • Definition 5: Monomial symmetric polynomials
  • Definition 6: Monomial column-symmetric polynomials
  • Example 3
  • ...and 16 more