SoftCFG: Uncertainty-guided Stable Guidance for Visual Autoregressive Model
Dongli Xu, Aleksei Tiulpin, Matthew B. Blaschko
TL;DR
SoftCFG tackles the fading and misalignment issues of classifier-free guidance in visual autoregressive image generation by distributing uncertainty-weighted influence from past tokens across the generation horizon. It introduces token-wise confidence weights to perturb the unconditional value cache and couples this with Step Normalization to bound cumulative perturbations, achieving stable long-horizon generation without retraining. Empirically, SoftCFG improves image fidelity and alignment, achieving state-of-the-art FID on ImageNet $256\times256$ among autoregressive models and maintaining competitive perceptual quality metrics. The approach is model-agnostic, training-free, and computationally lightweight, offering a practical pathway to more coherent conditional AR generation with minimal overhead.
Abstract
Autoregressive (AR) models have emerged as powerful tools for image generation by modeling images as sequences of discrete tokens. While Classifier-Free Guidance (CFG) has been adopted to improve conditional generation, its application in AR models faces two key issues: guidance diminishing, where the conditional-unconditional gap quickly vanishes as decoding progresses, and over-guidance, where strong conditions distort visual coherence. To address these challenges, we propose SoftCFG, an uncertainty-guided inference method that distributes adaptive perturbations across all tokens in the sequence. The key idea behind SoftCFG is to let each generated token contribute certainty-weighted guidance, ensuring that the signal persists across steps while resolving conflicts between text guidance and visual context. To further stabilize long-sequence generation, we introduce Step Normalization, which bounds cumulative perturbations of SoftCFG. Our method is training-free, model-agnostic, and seamlessly integrates with existing AR pipelines. Experiments show that SoftCFG significantly improves image quality over standard CFG and achieves state-of-the-art FID on ImageNet 256*256 among autoregressive models.
