Table of Contents
Fetching ...

StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models

Senmao Li, Kai Wang, Salman Khan, Fahad Shahbaz Khan, Jian Yang, Yaxing Wang

TL;DR

The paper tackles the computational inefficiency of Visual Autoregressive (VAR) image generation by analyzing inference into three stages: semantic establishment, structure establishment, and fidelity refinement. It introduces StageVAR, a plug-and-play, training-free acceleration framework that preserves early-stage content while accelerating the fidelity refinement stage using semantic irrelevance and low-rank feature techniques. Experiments on large VAR models show up to 3.4× speedups with negligible declines in GenEval and DPG metrics, and substantial gains on other benchmarks, demonstrating the practicality of stage-aware design for efficient VAR generation. The work highlights a general principle: safeguard early semantic and structural content while aggressively optimizing later refinement steps, offering a path toward real-time, high-fidelity VAR-based image synthesis.

Abstract

Visual Autoregressive (VAR) modeling departs from the next-token prediction paradigm of traditional Autoregressive (AR) models through next-scale prediction, enabling high-quality image generation. However, the VAR paradigm suffers from sharply increased computational complexity and running time at large-scale steps. Although existing acceleration methods reduce runtime for large-scale steps, but rely on manual step selection and overlook the varying importance of different stages in the generation process. To address this challenge, we present StageVAR, a systematic study and stage-aware acceleration framework for VAR models. Our analysis shows that early steps are critical for preserving semantic and structural consistency and should remain intact, while later steps mainly refine details and can be pruned or approximated for acceleration. Building on these insights, StageVAR introduces a plug-and-play acceleration strategy that exploits semantic irrelevance and low-rank properties in late-stage computations, without requiring additional training. Our proposed StageVAR achieves up to 3.4x speedup with only a 0.01 drop on GenEval and a 0.26 decrease on DPG, consistently outperforming existing acceleration baselines. These results highlight stage-aware design as a powerful principle for efficient visual autoregressive image generation.

StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models

TL;DR

The paper tackles the computational inefficiency of Visual Autoregressive (VAR) image generation by analyzing inference into three stages: semantic establishment, structure establishment, and fidelity refinement. It introduces StageVAR, a plug-and-play, training-free acceleration framework that preserves early-stage content while accelerating the fidelity refinement stage using semantic irrelevance and low-rank feature techniques. Experiments on large VAR models show up to 3.4× speedups with negligible declines in GenEval and DPG metrics, and substantial gains on other benchmarks, demonstrating the practicality of stage-aware design for efficient VAR generation. The work highlights a general principle: safeguard early semantic and structural content while aggressively optimizing later refinement steps, offering a path toward real-time, high-fidelity VAR-based image synthesis.

Abstract

Visual Autoregressive (VAR) modeling departs from the next-token prediction paradigm of traditional Autoregressive (AR) models through next-scale prediction, enabling high-quality image generation. However, the VAR paradigm suffers from sharply increased computational complexity and running time at large-scale steps. Although existing acceleration methods reduce runtime for large-scale steps, but rely on manual step selection and overlook the varying importance of different stages in the generation process. To address this challenge, we present StageVAR, a systematic study and stage-aware acceleration framework for VAR models. Our analysis shows that early steps are critical for preserving semantic and structural consistency and should remain intact, while later steps mainly refine details and can be pruned or approximated for acceleration. Building on these insights, StageVAR introduces a plug-and-play acceleration strategy that exploits semantic irrelevance and low-rank properties in late-stage computations, without requiring additional training. Our proposed StageVAR achieves up to 3.4x speedup with only a 0.01 drop on GenEval and a 0.26 decrease on DPG, consistently outperforming existing acceleration baselines. These results highlight stage-aware design as a powerful principle for efficient visual autoregressive image generation.

Paper Structure

This paper contains 26 sections, 8 equations, 14 figures, 11 tables, 1 algorithm.

Figures (14)

  • Figure 1: (a) Visualization of semantic evolution across all scale steps (i.e., CLIP and DINO). (b) Visualization of structure evolution on all scale steps (i.e., LPIPS and DISTS). (c) Variations of the next scale step in the frequency domain. (Bottom) Visualization of samples across all scale steps.
  • Figure 2: (Left) Evaluation of semantic and perceptual quality when the starting scale steps of CFG is set to 0. (Right) Sample visualizations obtained by setting CFG to 0 at large-scale steps.
  • Figure 3: Visualization of VAR inference across ① vanilla, ② the low-rank feature, and ③/④ the $r$-dimensional feature.
  • Figure 4: Overview of the proposed StageVAR framework. We retain the original VAR inference process for the semantic and structure establishment stages, while exploiting semantic irrelevance and low-rank properties in the fidelity refinement stage to accelerate inference.
  • Figure 5: Qualitative comparison with the vanilla Infinity-2B, Infinity-8B, and STAR models (1st, 3rd, and 5th rows). Our StageVAR (2nd, 4th, and 6th rows) achieves a $3.4\times$, $2.7\times$, and $1.74\times$ speedup while maintaining performance.
  • ...and 9 more figures