Diversity Has Always Been There in Your Visual Autoregressive Models
Tong Wang, Guanyu Yang, Nian Liu, Kai Wang, Yaxing Wang, Abdelrahman M Shaker, Salman Khan, Fahad Shahbaz Khan, Senmao Li
TL;DR
VAR models offer efficient, high-fidelity image generation via next-scale predictions but suffer from diversity collapse. The paper introduces DiverseVAR, a training-free framework that exploits early-scale dynamics by suppressing the pivotal input component with Soft-Suppression Regularization and augmenting the output with Soft-Amplification Regularization, achieved through SVD-based identification of dominant information. By applying SSR and SAR at early scales (notably scales 4 and 6) across all blocks, DiverseVAR significantly improves diversity (Recall, Coverage, FID) while maintaining text–image alignment and image quality on benchmarks such as GenEval, DPG, COCO, AFHQ, and CelebA-HQ. The work demonstrates that diversity is latent in the VAR architecture and can be unleashed without retraining, offering a practical, scalable enhancement for fast, diverse visual synthesis and shedding light on the role of early-scale structure in multimodal generation.
Abstract
Visual Autoregressive (VAR) models have recently garnered significant attention for their innovative next-scale prediction paradigm, offering notable advantages in both inference efficiency and image quality compared to traditional multi-step autoregressive (AR) and diffusion models. However, despite their efficiency, VAR models often suffer from the diversity collapse i.e., a reduction in output variability, analogous to that observed in few-step distilled diffusion models. In this paper, we introduce DiverseVAR, a simple yet effective approach that restores the generative diversity of VAR models without requiring any additional training. Our analysis reveals the pivotal component of the feature map as a key factor governing diversity formation at early scales. By suppressing the pivotal component in the model input and amplifying it in the model output, DiverseVAR effectively unlocks the inherent generative potential of VAR models while preserving high-fidelity synthesis. Empirical results demonstrate that our approach substantially enhances generative diversity with only neglectable performance influences. Our code will be publicly released at https://github.com/wangtong627/DiverseVAR.
