Table of Contents
Fetching ...

Not All Tokens are Guided Equal: Improving Guidance in Visual Autoregressive Models

Ky Dan Nguyen, Hoang Lam Tran, Anh-Dung Dinh, Daochang Liu, Weidong Cai, Xiuying Wang, Chang Xu

TL;DR

The paper tackles misalignment of guidance signals in scale-wise autoregressive (SwAR) image generation by showing that diffusion-like guidance is uneven in SwAR. It introduces Information-Grounding Guidance (IGG), which uses self-attention over guidance signals to emphasize semantically important regions and keep conditioning aligned with content, formalized as $\tilde{p}_\theta(s_k|c) = p_\theta(s_k) + f_k(s_k|c)\,p_\theta^\to(s_k|c)$. Across class-conditioned and text-to-image tasks, IGG consistently surpasses classifier-free guidance on metrics like FID, IS, CLIP, GenEval, and related scores, achieving state-of-the-art guidance performance for SwAR and diffusion backbones. The work also introduces evenness and divergence as interpretable guidance-dynamics metrics, showing that optimal sampling quality correlates with a balance between them, and highlights the importance of the guidance schedule design for practical gains.

Abstract

Autoregressive (AR) models based on next-scale prediction are rapidly emerging as a powerful tool for image generation, but they face a critical weakness: information inconsistencies between patches across timesteps introduced by progressive resolution scaling. These inconsistencies scatter guidance signals, causing them to drift away from conditioning information and leaving behind ambiguous, unfaithful features. We tackle this challenge with Information-Grounding Guidance (IGG), a novel mechanism that anchors guidance to semantically important regions through attention. By adaptively reinforcing informative patches during sampling, IGG ensures that guidance and content remain tightly aligned. Across both class-conditioned and text-to-image generation tasks, IGG delivers sharper, more coherent, and semantically grounded images, setting a new benchmark for AR-based methods.

Not All Tokens are Guided Equal: Improving Guidance in Visual Autoregressive Models

TL;DR

The paper tackles misalignment of guidance signals in scale-wise autoregressive (SwAR) image generation by showing that diffusion-like guidance is uneven in SwAR. It introduces Information-Grounding Guidance (IGG), which uses self-attention over guidance signals to emphasize semantically important regions and keep conditioning aligned with content, formalized as . Across class-conditioned and text-to-image tasks, IGG consistently surpasses classifier-free guidance on metrics like FID, IS, CLIP, GenEval, and related scores, achieving state-of-the-art guidance performance for SwAR and diffusion backbones. The work also introduces evenness and divergence as interpretable guidance-dynamics metrics, showing that optimal sampling quality correlates with a balance between them, and highlights the importance of the guidance schedule design for practical gains.

Abstract

Autoregressive (AR) models based on next-scale prediction are rapidly emerging as a powerful tool for image generation, but they face a critical weakness: information inconsistencies between patches across timesteps introduced by progressive resolution scaling. These inconsistencies scatter guidance signals, causing them to drift away from conditioning information and leaving behind ambiguous, unfaithful features. We tackle this challenge with Information-Grounding Guidance (IGG), a novel mechanism that anchors guidance to semantically important regions through attention. By adaptively reinforcing informative patches during sampling, IGG ensures that guidance and content remain tightly aligned. Across both class-conditioned and text-to-image generation tasks, IGG delivers sharper, more coherent, and semantically grounded images, setting a new benchmark for AR-based methods.

Paper Structure

This paper contains 17 sections, 7 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison between classifier-free guidance (top) and our method (bottom) on ImageNet $512\times512$ class-conditioned generation (class: cock), with VAR tian_visual_2024 as the backbone model. Each column corresponds to a sampling step. Each heat map depicts the distribution of guidance on tokens at the respective sampling step, ranging from purple (weak guidance) to yellow (strong guidance). Blue and red scores indicate the evenness and divergence at each step, where $\uparrow$/$\downarrow$ indicates that higher/lower is better (see Section \ref{['sec:motivation']} for further detail). Our method improves upon classifier-free guidance by concentrating guidance towards regions of foreground objects.
  • Figure 2: Distribution of guidance throughout the sampling process of EDM2 karras_analyzing_2024 (top) and VAR tian_visual_2024 (bottom) on ImageNet $512\times512$ class-conditioned generation (class: sports car). For the sake of comparison, the original number of sampling steps of EDM2 (32 steps) has been modified to match VAR. Sampling steps are respectively labelled with their evenness and divergence scores, with opaqueness corresponding to relative contributions to the weighted mean score. While EDM2 exhibits guidance signals that are sharp and consistently aligned to foreground objects, VAR exhibits guidance signals that are poorly aligned and becomes progressively fainter. Additional examples can be found in Appendix \ref{['app:diff-vs-swar-extra']}, and a visualisation of guidance in EDM2 across the full 32 sampling steps in Appendix \ref{['app:diff-guidance']}.
  • Figure 3: Example $1024\times1024$ generations of Switti under three guidance schemes: no guidance, CFG, and $\textsc{IGG}$ (ours). Without $\textsc{IGG}$, CFG or vanilla sampling of Switti has higher chance of generating failure features, around one in four samples.
  • Figure 4: Analysing the relationships between various metric scores attained by VAR-$d30$-$\textsc{IGG}$. From left to right: effect of changing guidance scales on reported evenness and divergence scores, where dashed and solid lines depict raw and scaled scores respectively; correspondence between FID and scaled evenness and divergence scores; and FID-IS trade-off curve.
  • Figure 5: Additional comparisons of guidance signals in EDM2 karras_analyzing_2024 (top) and VAR tian_visual_2024 (bottom).
  • ...and 2 more figures