A unified framework for establishing the universal approximation of transformer-type architectures
Jingpu Cheng, Ting Lin, Zuowei Shen, Qianxiao Li
TL;DR
The paper presents a unified, verifiable framework for the universal approximation property (UAP) of transformer-type architectures, unifying analysis across kernel-based, sparse, and other attention mechanisms. Central to the theory are a nonlinear, affine-invariant feedforward family and a token-mixing mechanism that can distinguish tokens under a permutation group G; when these conditions hold, the transformer family achieves $G$-UAP in $L^p(K)$ for compact sets. A key technical advance is reducing the token-distinguishability check to a two-sample test under an analytic parameterization, enabling broad applicability to diverse attention forms. The framework informs principled design of symmetry-aware and novel attention mechanisms (including those with bias terms and particular $D_n$ or $C_n$ symmetries) with guaranteed UAP, providing a non-constructive yet verifiable foundation for expressive transformer architectures.
Abstract
We investigate the universal approximation property (UAP) of transformer-type architectures, providing a unified theoretical framework that extends prior results on residual networks to models incorporating attention mechanisms. Our work identifies token distinguishability as a fundamental requirement for UAP and introduces a general sufficient condition that applies to a broad class of architectures. Leveraging an analyticity assumption on the attention layer, we can significantly simplify the verification of this condition, providing a non-constructive approach in establishing UAP for such architectures. We demonstrate the applicability of our framework by proving UAP for transformers with various attention mechanisms, including kernel-based and sparse attention mechanisms. The corollaries of our results either generalize prior works or establish UAP for architectures not previously covered. Furthermore, our framework offers a principled foundation for designing novel transformer architectures with inherent UAP guarantees, including those with specific functional symmetries. We propose examples to illustrate these insights.
