Table of Contents
Fetching ...

Go with Your Gut: Scaling Confidence for Autoregressive Image Generation

Harold Haodong Chen, Xianfeng Wu, Wen-Jie Shu, Rongjin Guo, Disen Lan, Harry Yang, Ying-Cong Chen

TL;DR

This work tackles the challenge of applying test-time scaling to autoregressive, next-token prediction for image generation. It presents ScalingAR, a two-level framework that uses token-entropy-based intrinsic signals and a dual-channel confidence profile to guide adaptive termination and dynamic conditioning without early decoding or external rewards. The approach yields substantial gains in image quality on GenEval and TIIF-Bench, along with significant token savings and improved robustness in difficult scenarios. By enabling phase-aware, confidence-driven scaling for AR image generation, ScalingAR offers a practical and scalable pathway toward more reliable token-based image synthesis.

Abstract

Test-time scaling (TTS) has demonstrated remarkable success in enhancing large language models, yet its application to next-token prediction (NTP) autoregressive (AR) image generation remains largely uncharted. Existing TTS approaches for visual AR (VAR), which rely on frequent partial decoding and external reward models, are ill-suited for NTP-based image generation due to the inherent incompleteness of intermediate decoding results. To bridge this gap, we introduce ScalingAR, the first TTS framework specifically designed for NTP-based AR image generation that eliminates the need for early decoding or auxiliary rewards. ScalingAR leverages token entropy as a novel signal in visual token generation and operates at two complementary scaling levels: (i) Profile Level, which streams a calibrated confidence state by fusing intrinsic and conditional signals; and (ii) Policy Level, which utilizes this state to adaptively terminate low-confidence trajectories and dynamically schedule guidance for phase-appropriate conditioning strength. Experiments on both general and compositional benchmarks show that ScalingAR (1) improves base models by 12.5% on GenEval and 15.2% on TIIF-Bench, (2) efficiently reduces visual token consumption by 62.0% while outperforming baselines, and (3) successfully enhances robustness, mitigating performance drops by 26.0% in challenging scenarios.

Go with Your Gut: Scaling Confidence for Autoregressive Image Generation

TL;DR

This work tackles the challenge of applying test-time scaling to autoregressive, next-token prediction for image generation. It presents ScalingAR, a two-level framework that uses token-entropy-based intrinsic signals and a dual-channel confidence profile to guide adaptive termination and dynamic conditioning without early decoding or external rewards. The approach yields substantial gains in image quality on GenEval and TIIF-Bench, along with significant token savings and improved robustness in difficult scenarios. By enabling phase-aware, confidence-driven scaling for AR image generation, ScalingAR offers a practical and scalable pathway toward more reliable token-based image synthesis.

Abstract

Test-time scaling (TTS) has demonstrated remarkable success in enhancing large language models, yet its application to next-token prediction (NTP) autoregressive (AR) image generation remains largely uncharted. Existing TTS approaches for visual AR (VAR), which rely on frequent partial decoding and external reward models, are ill-suited for NTP-based image generation due to the inherent incompleteness of intermediate decoding results. To bridge this gap, we introduce ScalingAR, the first TTS framework specifically designed for NTP-based AR image generation that eliminates the need for early decoding or auxiliary rewards. ScalingAR leverages token entropy as a novel signal in visual token generation and operates at two complementary scaling levels: (i) Profile Level, which streams a calibrated confidence state by fusing intrinsic and conditional signals; and (ii) Policy Level, which utilizes this state to adaptively terminate low-confidence trajectories and dynamically schedule guidance for phase-appropriate conditioning strength. Experiments on both general and compositional benchmarks show that ScalingAR (1) improves base models by 12.5% on GenEval and 15.2% on TIIF-Bench, (2) efficiently reduces visual token consumption by 62.0% while outperforming baselines, and (3) successfully enhances robustness, mitigating performance drops by 26.0% in challenging scenarios.

Paper Structure

This paper contains 40 sections, 15 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: (Top) ScalingAR significantly improves the quality of autoregressive image generation. Detailed prompts are provided in Appendix §\ref{['app:A']}. (Bottom Left) The token confidence trajectory over the generation process. (Bottom Right) Performance comparison of ScalingAR on TIIF-Bench with classic test-time scaling strategies, i.e., Importance Sampling (IS) and Best-of-N (BoN).
  • Figure 2: (a) Next-scale prediction paradigm generates multi-scale token maps coarse-to-fine. (b) Next-token prediction paradigm sequentially predicts next image tokens. (c) Illustration of Best-of-N sampling that generates multiple candidate and selects the best via voting or scoring. (d) Overview of our proposed ScalingAR, highlighting its ability to leverage token entropy to early-stop low-confidence samples and identify winning samples without the need for additional reward models.
  • Figure 3: (Left) Confidence distribution of ScalingAR on GenEval and TIIF-Bench. (Right) Illustration of the trade-off between visual quality and semantic alignment with fixed Classifier-Free Guidance (CFG) in AR image generation. 1st: A 35 mm photo of a cityscape resembling Moscow floating in the sky on flying islands.2nd: The colorful hot air balloon floated near the dark grey storm clouds.
  • Figure 4: Qualitative results of ScalingAR. More results on AR-GRPO are provided in Appendix §\ref{['app:exhibition']}.
  • Figure 5: (Left) User study across five dimensions: overall preference, aesthetic quality, realism fidelity, semantic alignment, attribute binding. (Middle) Visual token consumption of ScalingARvs. baselines on TIIF-Bench. (Right) Scaling width and depth across sample number and token length.
  • ...and 5 more figures