Not All Tokens are Guided Equal: Improving Guidance in Visual Autoregressive Models
Ky Dan Nguyen, Hoang Lam Tran, Anh-Dung Dinh, Daochang Liu, Weidong Cai, Xiuying Wang, Chang Xu
TL;DR
The paper tackles misalignment of guidance signals in scale-wise autoregressive (SwAR) image generation by showing that diffusion-like guidance is uneven in SwAR. It introduces Information-Grounding Guidance (IGG), which uses self-attention over guidance signals to emphasize semantically important regions and keep conditioning aligned with content, formalized as $\tilde{p}_\theta(s_k|c) = p_\theta(s_k) + f_k(s_k|c)\,p_\theta^\to(s_k|c)$. Across class-conditioned and text-to-image tasks, IGG consistently surpasses classifier-free guidance on metrics like FID, IS, CLIP, GenEval, and related scores, achieving state-of-the-art guidance performance for SwAR and diffusion backbones. The work also introduces evenness and divergence as interpretable guidance-dynamics metrics, showing that optimal sampling quality correlates with a balance between them, and highlights the importance of the guidance schedule design for practical gains.
Abstract
Autoregressive (AR) models based on next-scale prediction are rapidly emerging as a powerful tool for image generation, but they face a critical weakness: information inconsistencies between patches across timesteps introduced by progressive resolution scaling. These inconsistencies scatter guidance signals, causing them to drift away from conditioning information and leaving behind ambiguous, unfaithful features. We tackle this challenge with Information-Grounding Guidance (IGG), a novel mechanism that anchors guidance to semantically important regions through attention. By adaptively reinforcing informative patches during sampling, IGG ensures that guidance and content remain tightly aligned. Across both class-conditioned and text-to-image generation tasks, IGG delivers sharper, more coherent, and semantically grounded images, setting a new benchmark for AR-based methods.
