Table of Contents
Fetching ...

SoftCFG: Uncertainty-guided Stable Guidance for Visual Autoregressive Model

Dongli Xu, Aleksei Tiulpin, Matthew B. Blaschko

TL;DR

SoftCFG tackles the fading and misalignment issues of classifier-free guidance in visual autoregressive image generation by distributing uncertainty-weighted influence from past tokens across the generation horizon. It introduces token-wise confidence weights to perturb the unconditional value cache and couples this with Step Normalization to bound cumulative perturbations, achieving stable long-horizon generation without retraining. Empirically, SoftCFG improves image fidelity and alignment, achieving state-of-the-art FID on ImageNet $256\times256$ among autoregressive models and maintaining competitive perceptual quality metrics. The approach is model-agnostic, training-free, and computationally lightweight, offering a practical pathway to more coherent conditional AR generation with minimal overhead.

Abstract

Autoregressive (AR) models have emerged as powerful tools for image generation by modeling images as sequences of discrete tokens. While Classifier-Free Guidance (CFG) has been adopted to improve conditional generation, its application in AR models faces two key issues: guidance diminishing, where the conditional-unconditional gap quickly vanishes as decoding progresses, and over-guidance, where strong conditions distort visual coherence. To address these challenges, we propose SoftCFG, an uncertainty-guided inference method that distributes adaptive perturbations across all tokens in the sequence. The key idea behind SoftCFG is to let each generated token contribute certainty-weighted guidance, ensuring that the signal persists across steps while resolving conflicts between text guidance and visual context. To further stabilize long-sequence generation, we introduce Step Normalization, which bounds cumulative perturbations of SoftCFG. Our method is training-free, model-agnostic, and seamlessly integrates with existing AR pipelines. Experiments show that SoftCFG significantly improves image quality over standard CFG and achieves state-of-the-art FID on ImageNet 256*256 among autoregressive models.

SoftCFG: Uncertainty-guided Stable Guidance for Visual Autoregressive Model

TL;DR

SoftCFG tackles the fading and misalignment issues of classifier-free guidance in visual autoregressive image generation by distributing uncertainty-weighted influence from past tokens across the generation horizon. It introduces token-wise confidence weights to perturb the unconditional value cache and couples this with Step Normalization to bound cumulative perturbations, achieving stable long-horizon generation without retraining. Empirically, SoftCFG improves image fidelity and alignment, achieving state-of-the-art FID on ImageNet among autoregressive models and maintaining competitive perceptual quality metrics. The approach is model-agnostic, training-free, and computationally lightweight, offering a practical pathway to more coherent conditional AR generation with minimal overhead.

Abstract

Autoregressive (AR) models have emerged as powerful tools for image generation by modeling images as sequences of discrete tokens. While Classifier-Free Guidance (CFG) has been adopted to improve conditional generation, its application in AR models faces two key issues: guidance diminishing, where the conditional-unconditional gap quickly vanishes as decoding progresses, and over-guidance, where strong conditions distort visual coherence. To address these challenges, we propose SoftCFG, an uncertainty-guided inference method that distributes adaptive perturbations across all tokens in the sequence. The key idea behind SoftCFG is to let each generated token contribute certainty-weighted guidance, ensuring that the signal persists across steps while resolving conflicts between text guidance and visual context. To further stabilize long-sequence generation, we introduce Step Normalization, which bounds cumulative perturbations of SoftCFG. Our method is training-free, model-agnostic, and seamlessly integrates with existing AR pipelines. Experiments show that SoftCFG significantly improves image quality over standard CFG and achieves state-of-the-art FID on ImageNet 256*256 among autoregressive models.

Paper Structure

This paper contains 31 sections, 1 theorem, 26 equations, 10 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

Let $f_\theta$ be $L_t$-Lipschitz with respect to its value-cache input at step $t$. Then the deviation of SoftCFG from vanilla CFG is bounded as where $\tilde{\mathbf{z}}^{\text{uncond,pert}}_t$ denotes the unconditional logits under step-normalized perturbation.

Figures (10)

  • Figure 2: Comparison of images generated by standard Classifier-Free Guidance (CFG) and our proposed SoftCFG on LuminaGPT-8B xin2025luminamgpt. Unlike CFG, which applies the same conditional offset regardless of generation history, SoftCFG adaptively incorporates uncertainty from the already generated content. As a result, SoftCFG effectively reduces unreasonable artifacts, such as motorcycles collapsing into tangled shapes, extra trunks emerging from nowhere, or redundant hands in humans. This demonstrates that by aligning guidance with generated content, SoftCFG yields more coherent and visually plausible generations.
  • Figure 3: Diminishing effect of classifier-free guidance (CFG) in AR model Alitok-XL wu2025alitok. We plot the normalized entropy over generation steps. As generation progresses, the difference (i.e., the guidance signal, green line in the plot) between baseline (blue line) and CFG perturbation (orange line) entropy quickly vanishes. Here, a normalized entropy close to 1 indicates that guidance no longer provides informative guidance, please refer to Appendix \ref{['app:entropy']} for more details of normalized entropy.
  • Figure 4: Illustration of the over-guidance phenomenon. When applying a large guidance strength, the model over-emphasizes certain words in the prompt (e.g., "banana"), leading to distorted generations. In this example, the model incorrectly maps the word "banana" to the elephant’s tusk, highlighting how excessive guidance strength can harm semantic alignment.
  • Figure 5: Heatmaps of token confidence overlaid on generated images by LuminamGPT2 xin2025luminamgpt. High-confidence regions align well with salient semantic structures (e.g., object parts), while low-confidence regions occur in ambiguous backgrounds, supporting high-confidence tokens can be effective guidance signals.
  • Figure 6: Two perturbation strategies for Visual AR models. (a) Unconditional Perturbation modifies the first class conditional token regardless of the certainty score. (b) Uncertainty-guided Perturbation applies softer, weighted changes to all $\mathbf{V}$ cache entries, with strength $(1-\mathbf{P})$, offering stronger perturbation to high-confidence tokens.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Proposition 1: Bounded Deviation of Step-Normalized SoftCFG