Table of Contents
Fetching ...

Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation

Joonhyung Park, Hyeongwon Jang, Joowon Kim, Eunho Yang

TL;DR

GridAR tackles the challenge of scaling computation for visual autoregressive image generation by introducing a grid-based progressive generation strategy that explores multiple partial candidates per canvas region and anchors promising trajectories. A verifier-guided pruning step and a layout-aware prompt reformulation mechanism mitigate the lack of a global blueprint in raster-scan decoding, enabling more accurate adherence to complex prompts. The authors implement two reformulation strategies—three-way classifier-free guidance and prompt replacement—and demonstrate substantial gains over Best-of-N baselines in both text-to-image generation and image editing, without any additional training. This approach delivers a favorable cost-performance trade-off and broadens the practical potential of test-time scaling for autoregressive image synthesis.

Abstract

Recent visual autoregressive (AR) models have shown promising capabilities in text-to-image generation, operating in a manner similar to large language models. While test-time computation scaling has brought remarkable success in enabling reasoning-enhanced outputs for challenging natural language tasks, its adaptation to visual AR models remains unexplored and poses unique challenges. Naively applying test-time scaling strategies such as Best-of-N can be suboptimal: they consume full-length computation on erroneous generation trajectories, while the raster-scan decoding scheme lacks a blueprint of the entire canvas, limiting scaling benefits as only a few prompt-aligned candidates are generated. To address these, we introduce GridAR, a test-time scaling framework designed to elicit the best possible results from visual AR models. GridAR employs a grid-partitioned progressive generation scheme in which multiple partial candidates for the same position are generated within a canvas, infeasible ones are pruned early, and viable ones are fixed as anchors to guide subsequent decoding. Coupled with this, we present a layout-specified prompt reformulation strategy that inspects partial views to infer a feasible layout for satisfying the prompt. The reformulated prompt then guides subsequent image generation to mitigate the blueprint deficiency. Together, GridAR achieves higher-quality results under limited test-time scaling: with N=4, it even outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also generalizes to autoregressive image editing, showing comparable edit quality and a 13.9% gain in semantic preservation on PIE-Bench over larger-N baselines.

Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation

TL;DR

GridAR tackles the challenge of scaling computation for visual autoregressive image generation by introducing a grid-based progressive generation strategy that explores multiple partial candidates per canvas region and anchors promising trajectories. A verifier-guided pruning step and a layout-aware prompt reformulation mechanism mitigate the lack of a global blueprint in raster-scan decoding, enabling more accurate adherence to complex prompts. The authors implement two reformulation strategies—three-way classifier-free guidance and prompt replacement—and demonstrate substantial gains over Best-of-N baselines in both text-to-image generation and image editing, without any additional training. This approach delivers a favorable cost-performance trade-off and broadens the practical potential of test-time scaling for autoregressive image synthesis.

Abstract

Recent visual autoregressive (AR) models have shown promising capabilities in text-to-image generation, operating in a manner similar to large language models. While test-time computation scaling has brought remarkable success in enabling reasoning-enhanced outputs for challenging natural language tasks, its adaptation to visual AR models remains unexplored and poses unique challenges. Naively applying test-time scaling strategies such as Best-of-N can be suboptimal: they consume full-length computation on erroneous generation trajectories, while the raster-scan decoding scheme lacks a blueprint of the entire canvas, limiting scaling benefits as only a few prompt-aligned candidates are generated. To address these, we introduce GridAR, a test-time scaling framework designed to elicit the best possible results from visual AR models. GridAR employs a grid-partitioned progressive generation scheme in which multiple partial candidates for the same position are generated within a canvas, infeasible ones are pruned early, and viable ones are fixed as anchors to guide subsequent decoding. Coupled with this, we present a layout-specified prompt reformulation strategy that inspects partial views to infer a feasible layout for satisfying the prompt. The reformulated prompt then guides subsequent image generation to mitigate the blueprint deficiency. Together, GridAR achieves higher-quality results under limited test-time scaling: with N=4, it even outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also generalizes to autoregressive image editing, showing comparable edit quality and a 13.9% gain in semantic preservation on PIE-Bench over larger-N baselines.

Paper Structure

This paper contains 24 sections, 6 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: A grid-partitioned progressive image generation framework (GridAR) for test-time scaling of visual AR models.
  • Figure 2: GridAR (N=4) achieves 14.4% higher image quality through effective test-time scaling, surpassing Best-of-$N$ (N=8).
  • Figure 3: Visualization of Grid-based Progressive Generation process in two cases: (a) first-stage rejection (top row), where all candidates are accepted in the second stage; (b) second-stage rejection (bottom row), where all candidates are accepted in the first stage.
  • Figure 4: Motivation of Prompt Reformulation. Success rate increases significantly with the number of trials when prompt reformulation incorporates a plan for generating lower tokens, rather than relying only on the tokens generated in the upper part.
  • Figure 5: Qualitative Results comparing single-generation outputs, Best-of-$N$$(N=4)$ outputs, and outputs obtained by applying GridAR$(N=4)$ on text-to-image generation and image editing.
  • ...and 2 more figures