Table of Contents
Fetching ...

LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization

Rui Xie, Tianchen Zhao, Zhihang Yuan, Rui Wan, Wenxi Gao, Zhenhua Zhu, Xuefei Ning, Yu Wang

TL;DR

This work targets practical deployment of Visual Autoregressive (VAR) models by diagnosing three redundancy axes—attention maps, classifier-free guidance outputs, and data precision—and introducing three training-free compression techniques: Multi-Diagonal Windowed Attention (MDWA), Attention Sharing across CFG (ASC), and Mixed-Precision Quantization. Together, these methods deliver substantial efficiency gains (e.g., 85–90% attention computation savings, 50% memory reduction, 1.5x latency reduction) with negligible quality loss (FID increase < 0.056), and demonstrate feasibility of deploying VAR on resource-constrained hardware. The results highlight a scalable path to efficient AR-based image generation, especially at higher resolutions where attention costs grow quadratically. The approach combines algorithmic design with deployment-aware quantization, offering practical guidance for compressing AR-based visual generation models without retraining.

Abstract

Visual Autoregressive (VAR) has emerged as a promising approach in image generation, offering competitive potential and performance comparable to diffusion-based models. However, current AR-based visual generation models require substantial computational resources, limiting their applicability on resource-constrained devices. To address this issue, we conducted analysis and identified significant redundancy in three dimensions of the VAR model: (1) the attention map, (2) the attention outputs when using classifier free guidance, and (3) the data precision. Correspondingly, we proposed efficient attention mechanism and low-bit quantization method to enhance the efficiency of VAR models while maintaining performance. With negligible performance lost (less than 0.056 FID increase), we could achieve 85.2% reduction in attention computation, 50% reduction in overall memory and 1.5x latency reduction. To ensure deployment feasibility, we developed efficient training-free compression techniques and analyze the deployment feasibility and efficiency gain of each technique.

LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization

TL;DR

This work targets practical deployment of Visual Autoregressive (VAR) models by diagnosing three redundancy axes—attention maps, classifier-free guidance outputs, and data precision—and introducing three training-free compression techniques: Multi-Diagonal Windowed Attention (MDWA), Attention Sharing across CFG (ASC), and Mixed-Precision Quantization. Together, these methods deliver substantial efficiency gains (e.g., 85–90% attention computation savings, 50% memory reduction, 1.5x latency reduction) with negligible quality loss (FID increase < 0.056), and demonstrate feasibility of deploying VAR on resource-constrained hardware. The results highlight a scalable path to efficient AR-based image generation, especially at higher resolutions where attention costs grow quadratically. The approach combines algorithmic design with deployment-aware quantization, offering practical guidance for compressing AR-based visual generation models without retraining.

Abstract

Visual Autoregressive (VAR) has emerged as a promising approach in image generation, offering competitive potential and performance comparable to diffusion-based models. However, current AR-based visual generation models require substantial computational resources, limiting their applicability on resource-constrained devices. To address this issue, we conducted analysis and identified significant redundancy in three dimensions of the VAR model: (1) the attention map, (2) the attention outputs when using classifier free guidance, and (3) the data precision. Correspondingly, we proposed efficient attention mechanism and low-bit quantization method to enhance the efficiency of VAR models while maintaining performance. With negligible performance lost (less than 0.056 FID increase), we could achieve 85.2% reduction in attention computation, 50% reduction in overall memory and 1.5x latency reduction. To ensure deployment feasibility, we developed efficient training-free compression techniques and analyze the deployment feasibility and efficiency gain of each technique.

Paper Structure

This paper contains 8 sections, 3 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Three dimensions of redundancy and corresponding compression techniques. We discover redundancy exists in the attention map level, the classifier free guidance level, and the representation data precision level. We design the multi-diagonal windowed attention, CFG-wise sharing, and mixed precision quantization to address the above redundancy.
  • Figure 2: Attention map characteristics. (a) Multi-diagonal concentration. VAR model's attention values are concentrated on multiple diagonals, with each diagonal exhibiting a distinct shape across different scales. Consequently, we have designed a separate window attention mechanism for each scale, which we refer to as Multi-Diagonal Window Attention (MDWA). (b) Similarity of Attention Outputs between Conditional and Unconditional Generation.
  • Figure 3: Comparison of original image generation with the techniques of Multi-Diagonal Window Attention(MDWA) and CFG-wise attention sharing(ASC).
  • Figure 4: Comparison of original image, quantized image and quantized image with protection of sensitive layers. Top row: Naive quantized image exhibit substantial blurring or loss of legible content. Bottom row: A significant improvement in image quality post-quantization.
  • Figure 5: Comparison the impact on image quality of all seven types of linear layers: "word_ embed", "attn.mat_ qkv", "attn.proj", "ffn.fc1", "ffn.fc2", "ada_ lin.1", and "head".
  • ...and 3 more figures