Table of Contents
Fetching ...

DiverseVAR: Balancing Diversity and Quality of Next-Scale Visual Autoregressive Models

Mingue Park, Prin Phunyaphibarn, Phillip Y. Lee, Minhyuk Sung

TL;DR

DiverseVAR addresses the notable lack of per-prompt diversity in text-conditioned Visual Autoregressive Models (VARs) by introducing a training-free, two-stage approach. First, it applies diffusion-inspired diversity techniques, with noise injection into the text embedding (condition-annealing) proving most effective for VARs but at a cost to image quality. To recover quality while preserving diversity, it introduces Scale-Travel, a VAR-specific latent refinement that reverts to coarser scales via multi-scale encoding and resumes generation, mitigating artifacts introduced by noise. Across Infinity and Switti VARs on MS-COCO and MJHQ-30K, DiverseVAR yields a new Pareto frontier on the diversity–quality trade-off, outperforming CFG scheduling and CADS baselines and achieving meaningful gains with modest inference overhead. This work demonstrates that test-time refinements, aligned with VAR’s multi-scale structure, can substantially improve practical diversity without retraining, highlighting a promising direction for robust, controllable image synthesis with autoregressive models.

Abstract

We introduce DiverseVAR, a framework that enhances the diversity of text-conditioned visual autoregressive models (VAR) at test time without requiring retraining, fine-tuning, or substantial computational overhead. While VAR models have recently emerged as strong competitors to diffusion and flow models for image generation, they suffer from a critical limitation in diversity, often producing nearly identical images even for simple prompts. This issue has largely gone unnoticed amid the predominant focus on image quality. We address this limitation at test time in two stages. First, inspired by diversity enhancement techniques in diffusion models, we propose injecting noise into the text embedding. This introduces a trade-off between diversity and image quality: as diversity increases, the image quality sharply declines. To preserve quality, we propose scale-travel: a novel latent refinement technique inspired by time-travel strategies in diffusion models. Specifically, we use a multi-scale autoencoder to extract coarse-scale tokens that enable us to resume generation at intermediate stages. Extensive experiments show that combining text-embedding noise injection with our scale-travel refinement significantly enhances diversity while minimizing image-quality degradation, achieving a new Pareto frontier in the diversity-quality trade-off.

DiverseVAR: Balancing Diversity and Quality of Next-Scale Visual Autoregressive Models

TL;DR

DiverseVAR addresses the notable lack of per-prompt diversity in text-conditioned Visual Autoregressive Models (VARs) by introducing a training-free, two-stage approach. First, it applies diffusion-inspired diversity techniques, with noise injection into the text embedding (condition-annealing) proving most effective for VARs but at a cost to image quality. To recover quality while preserving diversity, it introduces Scale-Travel, a VAR-specific latent refinement that reverts to coarser scales via multi-scale encoding and resumes generation, mitigating artifacts introduced by noise. Across Infinity and Switti VARs on MS-COCO and MJHQ-30K, DiverseVAR yields a new Pareto frontier on the diversity–quality trade-off, outperforming CFG scheduling and CADS baselines and achieving meaningful gains with modest inference overhead. This work demonstrates that test-time refinements, aligned with VAR’s multi-scale structure, can substantially improve practical diversity without retraining, highlighting a promising direction for robust, controllable image synthesis with autoregressive models.

Abstract

We introduce DiverseVAR, a framework that enhances the diversity of text-conditioned visual autoregressive models (VAR) at test time without requiring retraining, fine-tuning, or substantial computational overhead. While VAR models have recently emerged as strong competitors to diffusion and flow models for image generation, they suffer from a critical limitation in diversity, often producing nearly identical images even for simple prompts. This issue has largely gone unnoticed amid the predominant focus on image quality. We address this limitation at test time in two stages. First, inspired by diversity enhancement techniques in diffusion models, we propose injecting noise into the text embedding. This introduces a trade-off between diversity and image quality: as diversity increases, the image quality sharply declines. To preserve quality, we propose scale-travel: a novel latent refinement technique inspired by time-travel strategies in diffusion models. Specifically, we use a multi-scale autoencoder to extract coarse-scale tokens that enable us to resume generation at intermediate stages. Extensive experiments show that combining text-embedding noise injection with our scale-travel refinement significantly enhances diversity while minimizing image-quality degradation, achieving a new Pareto frontier in the diversity-quality trade-off.

Paper Structure

This paper contains 39 sections, 6 equations, 20 figures, 10 tables, 2 algorithms.

Figures (20)

  • Figure 1: Diversity-Enhancement Techniques. We explore two main options for diversity enhancement in VAR: (a) CFG-scheduling and (b) condition-annealing. CFG-scheduling modulates the CFG scale over sampling steps to mitigate mode collapse. Condition-annealing injects noise into the text-embedding or the <SOS> token.
  • Figure 2: Latent Refinement via Scale-Travel. We introduce scale-travel, a novel refinement strategy that leverages the multi-scale structure of VAR models. By reverting intermediate representations to a coarser scale and running generation without noise injection, clean and coherent details can be reconstructed. This process corrects visual artifacts and degradations while preserving the overall coarse structures.
  • Figure 3: Multi-Scale Encoding
  • Figure 4: Qualitative Comparisons. Infinity produces images with little variation and diversity (row 1). Injecting noise into the text-embedding increases diversity (row 2) but results in visual artifacts (red box). Applying our scale-travel refinement technique (row 3) fixes these visual artifacts (green box) while retaining diversity
  • Figure 5: Pareto Fronts for Diversity-Quality Trade-Off on Infinity han2024infinity and Switti voronov2024switti. Each curve is obtained by varying hyperparameters for each method: noise scales for SOS and CADS, guidance schedules for CFG, and target scales for DiverseVAR (Ours). The default setting of each model is included as a reference. On both MJHQ-30K li2024mjhq and MS-COCO lin2014mscoco, DiverseVAR consistently sets the Pareto frontier, demonstrating the best balance between diversity and quality in VAR.
  • ...and 15 more figures