On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis
Yekun Ke, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song
TL;DR
This work investigates the computational limits of Visual AutoRegressive (VAR) models for image generation through a fine-grained complexity lens, showing a Strong Exponential Time Hypothesis (SETH) based lower bound that rules out sub-quartic time when the attention input range $R$ satisfies $R=\Theta(\sqrt{\log n})$ with $d=O(\log n)$. It also demonstrates an almost-quadratic time approximation $O(n^{2+o(1)})$ achievable when $R=o(\sqrt{\log n})$ via low-rank approximations, and provides explicit constructions aligning with these criteria. The paper dissects VAR into three stages (token-map generation, feature-map reconstruction, and VQ-VAE decoding) and analyzes running times and error propagation for fast variants of the VAR Transformer and VQ-VAE Decoder, establishing provable efficiency under practical assumptions. These results offer a theoretical framework for designing scalable autoregressive visual generators and delineate fundamental limits of attention-based speedups in this domain, guiding future hardware-aware optimizations and cross-domain extensions to text-to-image and multi-modal synthesis.
Abstract
Recently, Visual Autoregressive ($\mathsf{VAR}$) Models introduced a groundbreaking advancement in the field of image generation, offering a scalable approach through a coarse-to-fine ``next-scale prediction'' paradigm. Suppose that $n$ represents the height and width of the last VQ code map generated by $\mathsf{VAR}$ models, the state-of-the-art algorithm in [Tian, Jiang, Yuan, Peng and Wang, NeurIPS 2024] takes $O(n^{4+o(1)})$ time, which is computationally inefficient. In this work, we analyze the computational limits and efficiency criteria of $\mathsf{VAR}$ Models through a fine-grained complexity lens. Our key contribution is identifying the conditions under which $\mathsf{VAR}$ computations can achieve sub-quadratic time complexity. We have proved that assuming the Strong Exponential Time Hypothesis ($\mathsf{SETH}$) from fine-grained complexity theory, a sub-quartic time algorithm for $\mathsf{VAR}$ models is impossible. To substantiate our theoretical findings, we present efficient constructions leveraging low-rank approximations that align with the derived criteria. This work initiates the study of the computational efficiency of the $\mathsf{VAR}$ model from a theoretical perspective. Our technique will shed light on advancing scalable and efficient image generation in $\mathsf{VAR}$ frameworks.
