On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis

Yekun Ke; Xiaoyu Li; Yingyu Liang; Zhizhou Sha; Zhenmei Shi; Zhao Song

On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis

Yekun Ke, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song

TL;DR

This work investigates the computational limits of Visual AutoRegressive (VAR) models for image generation through a fine-grained complexity lens, showing a Strong Exponential Time Hypothesis (SETH) based lower bound that rules out sub-quartic time when the attention input range $R$ satisfies $R=\Theta(\sqrt{\log n})$ with $d=O(\log n)$. It also demonstrates an almost-quadratic time approximation $O(n^{2+o(1)})$ achievable when $R=o(\sqrt{\log n})$ via low-rank approximations, and provides explicit constructions aligning with these criteria. The paper dissects VAR into three stages (token-map generation, feature-map reconstruction, and VQ-VAE decoding) and analyzes running times and error propagation for fast variants of the VAR Transformer and VQ-VAE Decoder, establishing provable efficiency under practical assumptions. These results offer a theoretical framework for designing scalable autoregressive visual generators and delineate fundamental limits of attention-based speedups in this domain, guiding future hardware-aware optimizations and cross-domain extensions to text-to-image and multi-modal synthesis.

Abstract

Recently, Visual Autoregressive ($\mathsf{VAR}$) Models introduced a groundbreaking advancement in the field of image generation, offering a scalable approach through a coarse-to-fine ``next-scale prediction'' paradigm. Suppose that $n$ represents the height and width of the last VQ code map generated by $\mathsf{VAR}$ models, the state-of-the-art algorithm in [Tian, Jiang, Yuan, Peng and Wang, NeurIPS 2024] takes $O(n^{4+o(1)})$ time, which is computationally inefficient. In this work, we analyze the computational limits and efficiency criteria of $\mathsf{VAR}$ Models through a fine-grained complexity lens. Our key contribution is identifying the conditions under which $\mathsf{VAR}$ computations can achieve sub-quadratic time complexity. We have proved that assuming the Strong Exponential Time Hypothesis ($\mathsf{SETH}$) from fine-grained complexity theory, a sub-quartic time algorithm for $\mathsf{VAR}$ models is impossible. To substantiate our theoretical findings, we present efficient constructions leveraging low-rank approximations that align with the derived criteria. This work initiates the study of the computational efficiency of the $\mathsf{VAR}$ model from a theoretical perspective. Our technique will shed light on advancing scalable and efficient image generation in $\mathsf{VAR}$ frameworks.

On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis

TL;DR

satisfies

with

. It also demonstrates an almost-quadratic time approximation

achievable when

via low-rank approximations, and provides explicit constructions aligning with these criteria. The paper dissects VAR into three stages (token-map generation, feature-map reconstruction, and VQ-VAE decoding) and analyzes running times and error propagation for fast variants of the VAR Transformer and VQ-VAE Decoder, establishing provable efficiency under practical assumptions. These results offer a theoretical framework for designing scalable autoregressive visual generators and delineate fundamental limits of attention-based speedups in this domain, guiding future hardware-aware optimizations and cross-domain extensions to text-to-image and multi-modal synthesis.

Abstract

Recently, Visual Autoregressive (

) Models introduced a groundbreaking advancement in the field of image generation, offering a scalable approach through a coarse-to-fine ``next-scale prediction'' paradigm. Suppose that

represents the height and width of the last VQ code map generated by

models, the state-of-the-art algorithm in [Tian, Jiang, Yuan, Peng and Wang, NeurIPS 2024] takes

time, which is computationally inefficient. In this work, we analyze the computational limits and efficiency criteria of

Models through a fine-grained complexity lens. Our key contribution is identifying the conditions under which

computations can achieve sub-quadratic time complexity. We have proved that assuming the Strong Exponential Time Hypothesis (

) from fine-grained complexity theory, a sub-quartic time algorithm for

models is impossible. To substantiate our theoretical findings, we present efficient constructions leveraging low-rank approximations that align with the derived criteria. This work initiates the study of the computational efficiency of the

model from a theoretical perspective. Our technique will shed light on advancing scalable and efficient image generation in

frameworks.

Paper Structure (45 sections, 25 theorems, 57 equations, 1 figure)

This paper contains 45 sections, 25 theorems, 57 equations, 1 figure.

Introduction
Our Results
Roadmap.
Related Work
Visual Generation Models
AutoRegressive Models.
Diffusion Models.
Acceleration via Low-rank Approximation
Accelerating Attention Mechanisms.
Gradient Approximation.
Model Formulation
Notations
Overall Architecture
Stage 1: Token Maps Generation
Up Sample Blocks.
...and 30 more sections

Key Result

Theorem 1.1

Suppose $d = O(\log n)$ and $R = \Theta(\sqrt{\log n})$. Assuming $\mathsf{SETH}$, there is no algorithm that approximates the $\mathop{\mathrm{\mathsf{VAR}}}\nolimits$ model up to $1/\mathop{\mathrm{poly}}\nolimits(n)$ additive error in $O(n^{4-\Omega(1)})$ time.

Figures (1)

Figure 1: Example of the Pyramid Up-Interpolation Layer $\Phi_{{\rm up},2}$ used in the model.

Theorems & Definitions (56)

Theorem 1.1: Computational Limits of $\mathop{\mathrm{\mathsf{VAR}}}\nolimits$ Models, informal version of Theorem \ref{['thm:lower_bound:formal']}
Theorem 1.2: Existence of Almost Quadratic Time Algorithm, informal version of Theorem \ref{['thm:upper_bound:formal']}
Definition 3.1: Bicubic Spline Kernel
Definition 3.2: Up-interpolation Layer for One-Step Geometric Sequence
Definition 3.3: Pyramid Up-Interpolation Layer $\Phi_{{\rm}}$
Definition 3.4: Single Attention Layer
Definition 3.5: $\mathop{\mathrm{\mathsf{VAR}}}\nolimits$ transformer
Definition 3.6: Convolution Layer
Remark 3.7
Definition 4.2: Approximate Attention Computation $\mathsf{AAttC}(n, d, B, \delta)$, Definition 1.2 in as23
...and 46 more

On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis

TL;DR

Abstract

On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (56)