Table of Contents
Fetching ...

SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration

Zekun Li, Ning Wang, Tongxin Bai, Changwang Mei, Peisong Wang, Shuang Qiu, Jian Cheng

TL;DR

SparVAR addresses the high latency of visual autoregressive (VAR) models that arise from dense attention across many scales by exploiting cross-scale sparsity. It introduces CS^4A to predict high-resolution sparse patterns from a mid-scale and CSLA to enforce locality with a block-wise sparse kernel, enabling training-free acceleration that preserves high-frequency details. Extensive experiments on 1024×1024 generation with Infinity-8B demonstrate up to 1.57× faster inference without skipping scales and up to 2.28× when combined with scale-skipping strategies, with GenEval and low-level metrics showing comparable or better fidelity. The approach generalizes across VAR models (including HART) and offers a practical path to deploying high-resolution VAR generation with real-time responsiveness, supported by open-source code.

Abstract

Visual AutoRegressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction paradigm. However, mainstream VAR paradigms attend to all tokens across historical scales at each autoregressive step. As the next scale resolution grows, the computational complexity of attention increases quartically with resolution, causing substantial latency. Prior accelerations often skip high-resolution scales, which speeds up inference but discards high-frequency details and harms image quality. To address these problems, we present SparVAR, a training-free acceleration framework that exploits three properties of VAR attention: (i) strong attention sinks, (ii) cross-scale activation similarity, and (iii) pronounced locality. Specifically, we dynamically predict the sparse attention pattern of later high-resolution scales from a sparse decision scale, and construct scale self-similar sparse attention via an efficient index-mapping mechanism, enabling high-efficiency sparse attention computation at large scales. Furthermore, we propose cross-scale local sparse attention and implement an efficient block-wise sparse kernel, which achieves $\mathbf{> 5\times}$ faster forward speed than FlashAttention. Extensive experiments demonstrate that the proposed SparseVAR can reduce the generation time of an 8B model producing $1024\times1024$ high-resolution images to the 1s, without skipping the last scales. Compared with the VAR baseline accelerated by FlashAttention, our method achieves a $\mathbf{1.57\times}$ speed-up while preserving almost all high-frequency details. When combined with existing scale-skipping strategies, SparseVAR attains up to a $\mathbf{2.28\times}$ acceleration, while maintaining competitive visual generation quality. Code is available at https://github.com/CAS-CLab/SparVAR.

SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration

TL;DR

SparVAR addresses the high latency of visual autoregressive (VAR) models that arise from dense attention across many scales by exploiting cross-scale sparsity. It introduces CS^4A to predict high-resolution sparse patterns from a mid-scale and CSLA to enforce locality with a block-wise sparse kernel, enabling training-free acceleration that preserves high-frequency details. Extensive experiments on 1024×1024 generation with Infinity-8B demonstrate up to 1.57× faster inference without skipping scales and up to 2.28× when combined with scale-skipping strategies, with GenEval and low-level metrics showing comparable or better fidelity. The approach generalizes across VAR models (including HART) and offers a practical path to deploying high-resolution VAR generation with real-time responsiveness, supported by open-source code.

Abstract

Visual AutoRegressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction paradigm. However, mainstream VAR paradigms attend to all tokens across historical scales at each autoregressive step. As the next scale resolution grows, the computational complexity of attention increases quartically with resolution, causing substantial latency. Prior accelerations often skip high-resolution scales, which speeds up inference but discards high-frequency details and harms image quality. To address these problems, we present SparVAR, a training-free acceleration framework that exploits three properties of VAR attention: (i) strong attention sinks, (ii) cross-scale activation similarity, and (iii) pronounced locality. Specifically, we dynamically predict the sparse attention pattern of later high-resolution scales from a sparse decision scale, and construct scale self-similar sparse attention via an efficient index-mapping mechanism, enabling high-efficiency sparse attention computation at large scales. Furthermore, we propose cross-scale local sparse attention and implement an efficient block-wise sparse kernel, which achieves faster forward speed than FlashAttention. Extensive experiments demonstrate that the proposed SparseVAR can reduce the generation time of an 8B model producing high-resolution images to the 1s, without skipping the last scales. Compared with the VAR baseline accelerated by FlashAttention, our method achieves a speed-up while preserving almost all high-frequency details. When combined with existing scale-skipping strategies, SparseVAR attains up to a acceleration, while maintaining competitive visual generation quality. Code is available at https://github.com/CAS-CLab/SparVAR.
Paper Structure (37 sections, 27 equations, 12 figures, 8 tables)

This paper contains 37 sections, 27 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Our SparVAR achieves efficient acceleration while preserving high-frequency details consistent with the baseline infinity, whereas prior methods introduce visible artifacts and texture loss. The bottom metrics denotes GenEval / PSNR. Zoom in for fine-detail comparison.
  • Figure 2: Visualization of attention activation patterns in the Infinity infinity across different layers and heads. (a) Strong Attention Sinks: early-scale tokens consistently attract large attention weights, serving as global anchors that dominate image structure formation. (b) Cross-Scale Activation Similarity: corresponding sub-blocks across adjacent scales exhibit similar activation distributions, indicating redundant attention patterns that can be transferred across scales. (c) Pronounced Spatial Locality: at higher scales, attention becomes increasingly concentrated along local spatial bands, revealing strong locality both within and between neighboring scales.
  • Figure 3: The manifestation of attention sinks in the KV cache.
  • Figure 4: Visualization of the Cross-Scale Local Sparse Attention (CSLA) masks. This example shows the last scale attention map in Infinity ($q_{len}=4096$, $kv_{len}=10521$). The left shows the token-wise sparse mask, and the right shows the corresponding block-wise version after applying the block aggregation in Eq. \ref{['eq:block_mask']}. Red dashed grids denote $128\times 128$ blocks, while yellow dashed lines mark the attention partition boundaries described in Eq. \ref{['eq:attn_block']}. Zoom in for fine-detail visualization.
  • Figure 5: The impact of selecting sparse decision scales on generation results. Zoom in for fine-detail comparison.
  • ...and 7 more figures