Table of Contents
Fetching ...

SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation

Youngwoo Shin, Jiwan Hur, Junmo Kim

TL;DR

This work tackles train-inference mismatch in visual autoregressive models that generate images in a coarse-to-fine, multi-scale fashion. It introduces Scaled Spatial Guidance (SSG), a training-free inference-time scheme that steers each step toward adding high-frequency, scale-specific information by updating residual logits with a frequency-domain prior constructed via Discrete Spatial Enhancement (DSE). The method yields a closed-form update $\ell_k^{\text{SSG}} = \ell_k + \beta_k(\ell_k - \ell_{\text{prior}})$ and is applicable across discrete token architectures, improving fidelity and diversity with negligible latency. Empirical results show consistent gains across different VAR scales and tokenizations, improved FID/IS trade-offs, and robustness to different conditioning modalities. The approach is simple, model-agnostic, and opens the door to training-free improvements in other hierarchical generative models.

Abstract

Visual autoregressive (VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity synthesis mirroring human perception. In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature. We revisit this limitation from an information-theoretic perspective and deduce that ensuring each scale contributes high-frequency content not explained by earlier scales mitigates the train-inference discrepancy. With this insight, we propose Scaled Spatial Guidance (SSG), training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence. SSG emphasizes target high-frequency signals, defined as the semantic residual, isolated from a coarser prior. To obtain this prior, we leverage a principled frequency-domain procedure, Discrete Spatial Enhancement (DSE), which is devised to sharpen and better isolate the semantic residual through frequency-aware construction. SSG applies broadly across VAR models leveraging discrete visual tokens, regardless of tokenization design or conditioning modality. Experiments demonstrate SSG yields consistent gains in fidelity and diversity while preserving low latency, revealing untapped efficiency in coarse-to-fine image generation. Code is available at https://github.com/Youngwoo-git/SSG.

SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation

TL;DR

This work tackles train-inference mismatch in visual autoregressive models that generate images in a coarse-to-fine, multi-scale fashion. It introduces Scaled Spatial Guidance (SSG), a training-free inference-time scheme that steers each step toward adding high-frequency, scale-specific information by updating residual logits with a frequency-domain prior constructed via Discrete Spatial Enhancement (DSE). The method yields a closed-form update and is applicable across discrete token architectures, improving fidelity and diversity with negligible latency. Empirical results show consistent gains across different VAR scales and tokenizations, improved FID/IS trade-offs, and robustness to different conditioning modalities. The approach is simple, model-agnostic, and opens the door to training-free improvements in other hierarchical generative models.

Abstract

Visual autoregressive (VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity synthesis mirroring human perception. In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature. We revisit this limitation from an information-theoretic perspective and deduce that ensuring each scale contributes high-frequency content not explained by earlier scales mitigates the train-inference discrepancy. With this insight, we propose Scaled Spatial Guidance (SSG), training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence. SSG emphasizes target high-frequency signals, defined as the semantic residual, isolated from a coarser prior. To obtain this prior, we leverage a principled frequency-domain procedure, Discrete Spatial Enhancement (DSE), which is devised to sharpen and better isolate the semantic residual through frequency-aware construction. SSG applies broadly across VAR models leveraging discrete visual tokens, regardless of tokenization design or conditioning modality. Experiments demonstrate SSG yields consistent gains in fidelity and diversity while preserving low latency, revealing untapped efficiency in coarse-to-fine image generation. Code is available at https://github.com/Youngwoo-git/SSG.
Paper Structure (47 sections, 28 equations, 15 figures, 8 tables, 2 algorithms)

This paper contains 47 sections, 28 equations, 15 figures, 8 tables, 2 algorithms.

Figures (15)

  • Figure 1: SSG provides a training-free generation quality improvement for next-scale prediction models at negligible cost, yielding sharper detail, fewer artifacts, and preserved global coherence. Full input prompts and model specifications are in Appx. \ref{['app:intro_label']}.
  • Figure 2: Impact of SSG on Image Completion (VAR-d30).(Left) By amplifying the semantic residual, SSG enables the model to accurately reconstruct high-frequency details like the bird's beak (red box), unlike the baseline. (Right) Consistently better LPIPS substantiates this improvement.
  • Figure 3: Overview of a VAR-structured model with our Scaled Spatial Guidance (SSG) module. At each step, the autoregressive transformer predicts residual logits, which SSG refines by using a DSE-enhanced prior to isolate and amplify the high-frequency semantic residual before sampling.
  • Figure 4: Frequency-Domain Refinement and Performance.(a) Analysis of the $\Delta$ log magnitude of Fourier-transformed latent embeddings. SSG redistributes the spectral energy by suppressing redundant low frequencies while selectively boosting the essential high-frequency energy beyond the Nyquist frequency (red line). (b) SSG achieves a consistently better FID vs. IS trade-off across sampling temperatures, indicating an improved quality-diversity profile. See Fig. \ref{['fig:full_scale']} for the full trade-off graph over all evaluated sampling temperatures.
  • Figure 5: The trade-off between FID and IS of the guidance parameter $\beta_k$. The curve illustrates that optimizing solely for IS can be detrimental to the generation quality as measured by FID.
  • ...and 10 more figures