Table of Contents
Fetching ...

Guiding Visual Autoregressive Models through Spectrum Weakening

Chaoyang Wang, Tianmeng Yang, Jingdong Wang, Yunhai Tong

TL;DR

The paper tackles the limitations of classifier-free guidance (CFG) in visual autoregressive generation by introducing a training-free Spectrum Weakening Guidance (SWG). SWG creates a controllable weak model through spectrum selection in the channel dimension via a DFT-based mask, supplemented by two renormalization strategies to preserve energy and maintain numerical stability. The authors provide an information-theoretic rationale for why spectral masking reduces information in a controlled way and demonstrate strong, consistent improvements in both unconditional and conditional generation across discrete and continuous AR models (NOVA, Lumina-mGPT, RandAR) on COCO and ImageNet. The approach is architecture-preserving, training-free, and compatible with CFG, offering a flexible and interpretable mechanism to guide AR-based visual generation.

Abstract

Classifier-free guidance (CFG) has become a widely adopted and practical approach for enhancing generation quality and improving condition alignment. Recent studies have explored guidance mechanisms for unconditional generation, yet these approaches remain fundamentally tied to assumptions specific to diffusion models. In this work, we propose a spectrum-weakening framework for visual autoregressive (AR) models. This method works without the need for re-training, specific conditions, or any architectural modifications. It achieves this by constructing a controllable weak model in the spectral domain. We theoretically show that invertible spectral transformations preserve information, while selectively retaining only a subset of spectrum introduces controlled information reduction. Based on this insight, we perform spectrum selection along the channel dimension of internal representations, which avoids the structural constraints imposed by diffusion models. We further introduce two spectrum renormalization strategies that ensures numerical stability during the weakening process. Extensive experiments were conducted on both discrete and continuous AR models, with text or class conditioning. The results demonstrate that our method enables high-quality unconditional generation while maintaining strong prompt alignment for conditional generation.

Guiding Visual Autoregressive Models through Spectrum Weakening

TL;DR

The paper tackles the limitations of classifier-free guidance (CFG) in visual autoregressive generation by introducing a training-free Spectrum Weakening Guidance (SWG). SWG creates a controllable weak model through spectrum selection in the channel dimension via a DFT-based mask, supplemented by two renormalization strategies to preserve energy and maintain numerical stability. The authors provide an information-theoretic rationale for why spectral masking reduces information in a controlled way and demonstrate strong, consistent improvements in both unconditional and conditional generation across discrete and continuous AR models (NOVA, Lumina-mGPT, RandAR) on COCO and ImageNet. The approach is architecture-preserving, training-free, and compatible with CFG, offering a flexible and interpretable mechanism to guide AR-based visual generation.

Abstract

Classifier-free guidance (CFG) has become a widely adopted and practical approach for enhancing generation quality and improving condition alignment. Recent studies have explored guidance mechanisms for unconditional generation, yet these approaches remain fundamentally tied to assumptions specific to diffusion models. In this work, we propose a spectrum-weakening framework for visual autoregressive (AR) models. This method works without the need for re-training, specific conditions, or any architectural modifications. It achieves this by constructing a controllable weak model in the spectral domain. We theoretically show that invertible spectral transformations preserve information, while selectively retaining only a subset of spectrum introduces controlled information reduction. Based on this insight, we perform spectrum selection along the channel dimension of internal representations, which avoids the structural constraints imposed by diffusion models. We further introduce two spectrum renormalization strategies that ensures numerical stability during the weakening process. Extensive experiments were conducted on both discrete and continuous AR models, with text or class conditioning. The results demonstrate that our method enables high-quality unconditional generation while maintaining strong prompt alignment for conditional generation.

Paper Structure

This paper contains 13 sections, 1 theorem, 12 equations, 10 figures, 3 tables, 1 algorithm.

Key Result

Lemma 1

Let $\mathbf{P}\in\mathbb{C}^{C\times C}$ be invertible. For any pair of random variables $(\mathbf{x},\mathbf{Z})$

Figures (10)

  • Figure 1: Visual results of Spectrum Weakening Guidance (SWG). (a) Lumina-mGPT synthesizes images from a null prompt without any guidance. (b) Lumina-mGPT synthesizes images using the specified prompts, which are shown as white text in the first row.
  • Figure 2: Visual results of SWG on NOVA. (a) Model synthesizes with null prompt. (b) The model generates images according to the specified prompt, as indicated by the white text on the first row. Spatial renormalization is used.
  • Figure 3: Visual results of SWG on RandAR. The left three columns show unconditional generations, while the right three columns show conditional results.
  • Figure 4: Quantitative evaluation of RandAR under different SWG guidance strength $\omega_s$. From left to right, the curves depict FID, IS, Precision, and Recall with respect to the guidance scale.
  • Figure 5: Compatibility between SWG ($\omega_s$) and CFG ($\omega_c$) on the ImageNet dataset. The heatmap shows FID scores, with darker colors indicating better performance.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Theorem 1: Information loss under spectral selection