Table of Contents
Fetching ...

Diversity Has Always Been There in Your Visual Autoregressive Models

Tong Wang, Guanyu Yang, Nian Liu, Kai Wang, Yaxing Wang, Abdelrahman M Shaker, Salman Khan, Fahad Shahbaz Khan, Senmao Li

TL;DR

VAR models offer efficient, high-fidelity image generation via next-scale predictions but suffer from diversity collapse. The paper introduces DiverseVAR, a training-free framework that exploits early-scale dynamics by suppressing the pivotal input component with Soft-Suppression Regularization and augmenting the output with Soft-Amplification Regularization, achieved through SVD-based identification of dominant information. By applying SSR and SAR at early scales (notably scales 4 and 6) across all blocks, DiverseVAR significantly improves diversity (Recall, Coverage, FID) while maintaining text–image alignment and image quality on benchmarks such as GenEval, DPG, COCO, AFHQ, and CelebA-HQ. The work demonstrates that diversity is latent in the VAR architecture and can be unleashed without retraining, offering a practical, scalable enhancement for fast, diverse visual synthesis and shedding light on the role of early-scale structure in multimodal generation.

Abstract

Visual Autoregressive (VAR) models have recently garnered significant attention for their innovative next-scale prediction paradigm, offering notable advantages in both inference efficiency and image quality compared to traditional multi-step autoregressive (AR) and diffusion models. However, despite their efficiency, VAR models often suffer from the diversity collapse i.e., a reduction in output variability, analogous to that observed in few-step distilled diffusion models. In this paper, we introduce DiverseVAR, a simple yet effective approach that restores the generative diversity of VAR models without requiring any additional training. Our analysis reveals the pivotal component of the feature map as a key factor governing diversity formation at early scales. By suppressing the pivotal component in the model input and amplifying it in the model output, DiverseVAR effectively unlocks the inherent generative potential of VAR models while preserving high-fidelity synthesis. Empirical results demonstrate that our approach substantially enhances generative diversity with only neglectable performance influences. Our code will be publicly released at https://github.com/wangtong627/DiverseVAR.

Diversity Has Always Been There in Your Visual Autoregressive Models

TL;DR

VAR models offer efficient, high-fidelity image generation via next-scale predictions but suffer from diversity collapse. The paper introduces DiverseVAR, a training-free framework that exploits early-scale dynamics by suppressing the pivotal input component with Soft-Suppression Regularization and augmenting the output with Soft-Amplification Regularization, achieved through SVD-based identification of dominant information. By applying SSR and SAR at early scales (notably scales 4 and 6) across all blocks, DiverseVAR significantly improves diversity (Recall, Coverage, FID) while maintaining text–image alignment and image quality on benchmarks such as GenEval, DPG, COCO, AFHQ, and CelebA-HQ. The work demonstrates that diversity is latent in the VAR architecture and can be unleashed without retraining, offering a practical, scalable enhancement for fast, diverse visual synthesis and shedding light on the role of early-scale structure in multimodal generation.

Abstract

Visual Autoregressive (VAR) models have recently garnered significant attention for their innovative next-scale prediction paradigm, offering notable advantages in both inference efficiency and image quality compared to traditional multi-step autoregressive (AR) and diffusion models. However, despite their efficiency, VAR models often suffer from the diversity collapse i.e., a reduction in output variability, analogous to that observed in few-step distilled diffusion models. In this paper, we introduce DiverseVAR, a simple yet effective approach that restores the generative diversity of VAR models without requiring any additional training. Our analysis reveals the pivotal component of the feature map as a key factor governing diversity formation at early scales. By suppressing the pivotal component in the model input and amplifying it in the model output, DiverseVAR effectively unlocks the inherent generative potential of VAR models while preserving high-fidelity synthesis. Empirical results demonstrate that our approach substantially enhances generative diversity with only neglectable performance influences. Our code will be publicly released at https://github.com/wangtong627/DiverseVAR.

Paper Structure

This paper contains 26 sections, 7 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: Multiple generation samples from the vanilla VAR models (1st and 3rd rows) and our DiverseVAR (2nd and 4th rows). While vanilla VAR models suffer from the diversity collapse, our method generates more diverse outputs while maintaining image–text alignment. The text prompts used are as follows: "A man in a clown mask eating a donut", "A cat wearing a Halloween costume", "Golden Gate Bridge at sunset, glowing sky, ...", "A palace under the sunset", "A cool astronaut floating in space", and "A cat riding a skateboard down a hill".
  • Figure 2: Visualization of samples across all scales (1st row) and their associated DINO features (2nd row).
  • Figure 3: (Left) Statistics of structure evolution on all scale steps. (Right) The relative log amplitude of frequency components across different scales.
  • Figure 4: Visualization of samples when zeroing out the pivotal (1st row) or auxiliary (2nd row) tokens across all scales except the 1st scale (1st–12th columns), along with the vanilla generation results (last column).
  • Figure 5: Structural (Left) and semantic (Right) evaluation when pivotal and auxiliary tokens are zeroed out.
  • ...and 8 more figures