Universal Approximation of Visual Autoregressive Transformers
Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song
TL;DR
The paper addresses whether Visual Autoregressive (VAR) transformers and FlowAR models possess universal approximation capabilities. It develops a formal theory showing that even minimal configurations — a single self-attention layer with up-sampling — can approximate any Lipschitz image-to-image transformation, and extends the results to FlowAR with similar guarantees. The authors introduce the concepts of contextual mapping for attention, analyze up-sampling interactions, and provide perturbation-based proofs that yield explicit error bounds. These results unify transformer universality with VAR/flow-based architectures, offering design principles for efficient, scalable image generation. The findings have practical impact by justifying coarse-to-fine VAR and FlowAR strategies and guiding future architectural choices in generative vision systems.
Abstract
We investigate the fundamental limits of transformer-based foundation models, extending our analysis to include Visual Autoregressive (VAR) transformers. VAR represents a big step toward generating images using a novel, scalable, coarse-to-fine ``next-scale prediction'' framework. These models set a new quality bar, outperforming all previous methods, including Diffusion Transformers, while having state-of-the-art performance for image synthesis tasks. Our primary contributions establish that, for single-head VAR transformers with a single self-attention layer and single interpolation layer, the VAR Transformer is universal. From the statistical perspective, we prove that such simple VAR transformers are universal approximators for any image-to-image Lipschitz functions. Furthermore, we demonstrate that flow-based autoregressive transformers inherit similar approximation capabilities. Our results provide important design principles for effective and computationally efficient VAR Transformer strategies that can be used to extend their utility to more sophisticated VAR models in image generation and other related areas.
