Table of Contents
Fetching ...

Autoregressive Image Generation with Randomized Parallel Decoding

Haopeng Li, Jinyue Yang, Guoqi Li, Huan Wang

TL;DR

ARPG introduces a decoupled two-pass autoregressive framework that enables fully random-order image generation with parallel decoding. By separating content representation learning (Pass-1) from position-guided token prediction (Pass-2) and using data-independent [MASK] queries to attend to a shared content KV cache, ARPG achieves high fidelity while dramatically improving throughput and reducing memory use relative to raster-order and other parallel AR methods. The approach supports zero-shot generalization (inpainting, outpainting, resolution expansion) and controllable/text-to-image generation, demonstrated on ImageNet-1K $256\times256$ with $32$ steps achieving $FID=1.83$, and substantial speedups (about $30\times$) and memory reductions. These results indicate that fully causal training with randomized parallel decoding is both feasible and advantageous for scalable, flexible visual synthesis, with strong implications for efficiency-focused deployment and broader task generalization.

Abstract

We introduce ARPG, a novel visual autoregressive model that enables randomized parallel generation, addressing the inherent limitations of conventional raster-order approaches, which hinder inference efficiency and zero-shot generalization due to their sequential, predefined token generation order. Our key insight is that effective random-order modeling necessitates explicit guidance for determining the position of the next predicted token. To this end, we propose a novel decoupled decoding framework that decouples positional guidance from content representation, encoding them separately as queries and key-value pairs. By directly incorporating this guidance into the causal attention mechanism, our approach enables fully random-order training and generation, eliminating the need for bidirectional attention. Consequently, ARPG readily generalizes to zero-shot inference tasks such as image inpainting, outpainting, and resolution expansion. Furthermore, it supports parallel inference by concurrently processing multiple queries using a shared KV cache. On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.83 with only 32 sampling steps, achieving over a 30 times speedup in inference and a 75 percent reduction in memory consumption compared to representative recent autoregressive models at a similar scale.

Autoregressive Image Generation with Randomized Parallel Decoding

TL;DR

ARPG introduces a decoupled two-pass autoregressive framework that enables fully random-order image generation with parallel decoding. By separating content representation learning (Pass-1) from position-guided token prediction (Pass-2) and using data-independent [MASK] queries to attend to a shared content KV cache, ARPG achieves high fidelity while dramatically improving throughput and reducing memory use relative to raster-order and other parallel AR methods. The approach supports zero-shot generalization (inpainting, outpainting, resolution expansion) and controllable/text-to-image generation, demonstrated on ImageNet-1K with steps achieving , and substantial speedups (about ) and memory reductions. These results indicate that fully causal training with randomized parallel decoding is both feasible and advantageous for scalable, flexible visual synthesis, with strong implications for efficiency-focused deployment and broader task generalization.

Abstract

We introduce ARPG, a novel visual autoregressive model that enables randomized parallel generation, addressing the inherent limitations of conventional raster-order approaches, which hinder inference efficiency and zero-shot generalization due to their sequential, predefined token generation order. Our key insight is that effective random-order modeling necessitates explicit guidance for determining the position of the next predicted token. To this end, we propose a novel decoupled decoding framework that decouples positional guidance from content representation, encoding them separately as queries and key-value pairs. By directly incorporating this guidance into the causal attention mechanism, our approach enables fully random-order training and generation, eliminating the need for bidirectional attention. Consequently, ARPG readily generalizes to zero-shot inference tasks such as image inpainting, outpainting, and resolution expansion. Furthermore, it supports parallel inference by concurrently processing multiple queries using a shared KV cache. On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.83 with only 32 sampling steps, achieving over a 30 times speedup in inference and a 75 percent reduction in memory consumption compared to representative recent autoregressive models at a similar scale.

Paper Structure

This paper contains 45 sections, 10 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Methods for representing the position of the next token.MARchang2022maskgitli2024autoregressive indicates the position via masking out image token; Block-AR models tian2024visualwang2025parallelizedhe2025nar use predefined positions; RandARpang2025randar intersperses position tokens throughout the sequence; and our ARPG integrates it as a query in a cross-attention mechanism.
  • Figure 2: Analysis of attention scores. Normalized attention maps from multiple distinct heads in the final layer of RandAR pang2025randar. The maps, partitioned by token type (masked vs. unmasked), reveal that attention weights are predominantly concentrated on unmasked tokens.
  • Figure 3: Architecture: The 1st decoder extract representations of image tokens. The 2nd decoder use target-aware [MASK] tokens as queries that attend to key-value pairs from the output of the 1st decoder. Teacher-forcing training is performed under a causal attention. Parallel decoding is achieved by inputting multiple queries in a single step, with each query independently attending to existing KV cache (omit value for clarity).
  • Figure 4: Implementation details. (a) Conditional inputs provide the queries. (b) For zero-shot inpainting, known regions are pre-filled in Pass-1, while masked regions are generated in Pass-2.
  • Figure 5: Generation samples. ARPG can efficiently generate high-fidelity images with 64 steps
  • ...and 7 more figures

Theorems & Definitions (1)

  • proof