LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization
Rui Xie, Tianchen Zhao, Zhihang Yuan, Rui Wan, Wenxi Gao, Zhenhua Zhu, Xuefei Ning, Yu Wang
TL;DR
This work targets practical deployment of Visual Autoregressive (VAR) models by diagnosing three redundancy axes—attention maps, classifier-free guidance outputs, and data precision—and introducing three training-free compression techniques: Multi-Diagonal Windowed Attention (MDWA), Attention Sharing across CFG (ASC), and Mixed-Precision Quantization. Together, these methods deliver substantial efficiency gains (e.g., 85–90% attention computation savings, 50% memory reduction, 1.5x latency reduction) with negligible quality loss (FID increase < 0.056), and demonstrate feasibility of deploying VAR on resource-constrained hardware. The results highlight a scalable path to efficient AR-based image generation, especially at higher resolutions where attention costs grow quadratically. The approach combines algorithmic design with deployment-aware quantization, offering practical guidance for compressing AR-based visual generation models without retraining.
Abstract
Visual Autoregressive (VAR) has emerged as a promising approach in image generation, offering competitive potential and performance comparable to diffusion-based models. However, current AR-based visual generation models require substantial computational resources, limiting their applicability on resource-constrained devices. To address this issue, we conducted analysis and identified significant redundancy in three dimensions of the VAR model: (1) the attention map, (2) the attention outputs when using classifier free guidance, and (3) the data precision. Correspondingly, we proposed efficient attention mechanism and low-bit quantization method to enhance the efficiency of VAR models while maintaining performance. With negligible performance lost (less than 0.056 FID increase), we could achieve 85.2% reduction in attention computation, 50% reduction in overall memory and 1.5x latency reduction. To ensure deployment feasibility, we developed efficient training-free compression techniques and analyze the deployment feasibility and efficiency gain of each technique.
