Table of Contents
Fetching ...

On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis

Yekun Ke, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song

TL;DR

This work investigates the computational limits of Visual AutoRegressive (VAR) models for image generation through a fine-grained complexity lens, showing a Strong Exponential Time Hypothesis (SETH) based lower bound that rules out sub-quartic time when the attention input range $R$ satisfies $R=\Theta(\sqrt{\log n})$ with $d=O(\log n)$. It also demonstrates an almost-quadratic time approximation $O(n^{2+o(1)})$ achievable when $R=o(\sqrt{\log n})$ via low-rank approximations, and provides explicit constructions aligning with these criteria. The paper dissects VAR into three stages (token-map generation, feature-map reconstruction, and VQ-VAE decoding) and analyzes running times and error propagation for fast variants of the VAR Transformer and VQ-VAE Decoder, establishing provable efficiency under practical assumptions. These results offer a theoretical framework for designing scalable autoregressive visual generators and delineate fundamental limits of attention-based speedups in this domain, guiding future hardware-aware optimizations and cross-domain extensions to text-to-image and multi-modal synthesis.

Abstract

Recently, Visual Autoregressive ($\mathsf{VAR}$) Models introduced a groundbreaking advancement in the field of image generation, offering a scalable approach through a coarse-to-fine ``next-scale prediction'' paradigm. Suppose that $n$ represents the height and width of the last VQ code map generated by $\mathsf{VAR}$ models, the state-of-the-art algorithm in [Tian, Jiang, Yuan, Peng and Wang, NeurIPS 2024] takes $O(n^{4+o(1)})$ time, which is computationally inefficient. In this work, we analyze the computational limits and efficiency criteria of $\mathsf{VAR}$ Models through a fine-grained complexity lens. Our key contribution is identifying the conditions under which $\mathsf{VAR}$ computations can achieve sub-quadratic time complexity. We have proved that assuming the Strong Exponential Time Hypothesis ($\mathsf{SETH}$) from fine-grained complexity theory, a sub-quartic time algorithm for $\mathsf{VAR}$ models is impossible. To substantiate our theoretical findings, we present efficient constructions leveraging low-rank approximations that align with the derived criteria. This work initiates the study of the computational efficiency of the $\mathsf{VAR}$ model from a theoretical perspective. Our technique will shed light on advancing scalable and efficient image generation in $\mathsf{VAR}$ frameworks.

On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis

TL;DR

This work investigates the computational limits of Visual AutoRegressive (VAR) models for image generation through a fine-grained complexity lens, showing a Strong Exponential Time Hypothesis (SETH) based lower bound that rules out sub-quartic time when the attention input range satisfies with . It also demonstrates an almost-quadratic time approximation achievable when via low-rank approximations, and provides explicit constructions aligning with these criteria. The paper dissects VAR into three stages (token-map generation, feature-map reconstruction, and VQ-VAE decoding) and analyzes running times and error propagation for fast variants of the VAR Transformer and VQ-VAE Decoder, establishing provable efficiency under practical assumptions. These results offer a theoretical framework for designing scalable autoregressive visual generators and delineate fundamental limits of attention-based speedups in this domain, guiding future hardware-aware optimizations and cross-domain extensions to text-to-image and multi-modal synthesis.

Abstract

Recently, Visual Autoregressive () Models introduced a groundbreaking advancement in the field of image generation, offering a scalable approach through a coarse-to-fine ``next-scale prediction'' paradigm. Suppose that represents the height and width of the last VQ code map generated by models, the state-of-the-art algorithm in [Tian, Jiang, Yuan, Peng and Wang, NeurIPS 2024] takes time, which is computationally inefficient. In this work, we analyze the computational limits and efficiency criteria of Models through a fine-grained complexity lens. Our key contribution is identifying the conditions under which computations can achieve sub-quadratic time complexity. We have proved that assuming the Strong Exponential Time Hypothesis () from fine-grained complexity theory, a sub-quartic time algorithm for models is impossible. To substantiate our theoretical findings, we present efficient constructions leveraging low-rank approximations that align with the derived criteria. This work initiates the study of the computational efficiency of the model from a theoretical perspective. Our technique will shed light on advancing scalable and efficient image generation in frameworks.
Paper Structure (45 sections, 25 theorems, 57 equations, 1 figure)

This paper contains 45 sections, 25 theorems, 57 equations, 1 figure.

Key Result

Theorem 1.1

Suppose $d = O(\log n)$ and $R = \Theta(\sqrt{\log n})$. Assuming $\mathsf{SETH}$, there is no algorithm that approximates the $\mathop{\mathrm{\mathsf{VAR}}}\nolimits$ model up to $1/\mathop{\mathrm{poly}}\nolimits(n)$ additive error in $O(n^{4-\Omega(1)})$ time.

Figures (1)

  • Figure 1: Example of the Pyramid Up-Interpolation Layer $\Phi_{{\rm up},2}$ used in the model.

Theorems & Definitions (56)

  • Theorem 1.1: Computational Limits of $\mathop{\mathrm{\mathsf{VAR}}}\nolimits$ Models, informal version of Theorem \ref{['thm:lower_bound:formal']}
  • Theorem 1.2: Existence of Almost Quadratic Time Algorithm, informal version of Theorem \ref{['thm:upper_bound:formal']}
  • Definition 3.1: Bicubic Spline Kernel
  • Definition 3.2: Up-interpolation Layer for One-Step Geometric Sequence
  • Definition 3.3: Pyramid Up-Interpolation Layer $\Phi_{{\rm}}$
  • Definition 3.4: Single Attention Layer
  • Definition 3.5: $\mathop{\mathrm{\mathsf{VAR}}}\nolimits$ transformer
  • Definition 3.6: Convolution Layer
  • Remark 3.7
  • Definition 4.2: Approximate Attention Computation $\mathsf{AAttC}(n, d, B, \delta)$, Definition 1.2 in as23
  • ...and 46 more