Table of Contents
Fetching ...

On the Universality of Transformer Architectures; How Much Attention Is Enough?

Amirreza Abbasi, Mohsen Hooshmand

TL;DR

The paper surveys Transformer expressiveness through an approximation-theoretic lens, synthesizing universality results for vanilla, sparse, and efficient variants. It shows that universality is robust to many architectural constraints as long as global information flow is preserved, and discusses minimal architectures, approximation rates, high-dimensional input regimes, and prompting-based universality. Practically, it offers guidance for choosing architecture variants that balance memory and computation with expressive power. It also outlines open questions on learnability, training dynamics, and rates beyond classical smooth-function settings.

Abstract

Transformers are crucial across many AI fields, such as large language models, computer vision, and reinforcement learning. This prominence stems from the architecture's perceived universality and scalability compared to alternatives. This work examines the problem of universality in Transformers, reviews recent progress, including architectural refinements such as structural minimality and approximation rates, and surveys state-of-the-art advances that inform both theoretical and practical understanding. Our aim is to clarify what is currently known about Transformers expressiveness, separate robust guarantees from fragile ones, and identify key directions for future theoretical research.

On the Universality of Transformer Architectures; How Much Attention Is Enough?

TL;DR

The paper surveys Transformer expressiveness through an approximation-theoretic lens, synthesizing universality results for vanilla, sparse, and efficient variants. It shows that universality is robust to many architectural constraints as long as global information flow is preserved, and discusses minimal architectures, approximation rates, high-dimensional input regimes, and prompting-based universality. Practically, it offers guidance for choosing architecture variants that balance memory and computation with expressive power. It also outlines open questions on learnability, training dynamics, and rates beyond classical smooth-function settings.

Abstract

Transformers are crucial across many AI fields, such as large language models, computer vision, and reinforcement learning. This prominence stems from the architecture's perceived universality and scalability compared to alternatives. This work examines the problem of universality in Transformers, reviews recent progress, including architectural refinements such as structural minimality and approximation rates, and surveys state-of-the-art advances that inform both theoretical and practical understanding. Our aim is to clarify what is currently known about Transformers expressiveness, separate robust guarantees from fragile ones, and identify key directions for future theoretical research.

Paper Structure

This paper contains 14 sections, 2 equations, 1 table.