Autoregressive Image Generation with Randomized Parallel Decoding
Haopeng Li, Jinyue Yang, Guoqi Li, Huan Wang
TL;DR
ARPG introduces a decoupled two-pass autoregressive framework that enables fully random-order image generation with parallel decoding. By separating content representation learning (Pass-1) from position-guided token prediction (Pass-2) and using data-independent [MASK] queries to attend to a shared content KV cache, ARPG achieves high fidelity while dramatically improving throughput and reducing memory use relative to raster-order and other parallel AR methods. The approach supports zero-shot generalization (inpainting, outpainting, resolution expansion) and controllable/text-to-image generation, demonstrated on ImageNet-1K $256\times256$ with $32$ steps achieving $FID=1.83$, and substantial speedups (about $30\times$) and memory reductions. These results indicate that fully causal training with randomized parallel decoding is both feasible and advantageous for scalable, flexible visual synthesis, with strong implications for efficiency-focused deployment and broader task generalization.
Abstract
We introduce ARPG, a novel visual autoregressive model that enables randomized parallel generation, addressing the inherent limitations of conventional raster-order approaches, which hinder inference efficiency and zero-shot generalization due to their sequential, predefined token generation order. Our key insight is that effective random-order modeling necessitates explicit guidance for determining the position of the next predicted token. To this end, we propose a novel decoupled decoding framework that decouples positional guidance from content representation, encoding them separately as queries and key-value pairs. By directly incorporating this guidance into the causal attention mechanism, our approach enables fully random-order training and generation, eliminating the need for bidirectional attention. Consequently, ARPG readily generalizes to zero-shot inference tasks such as image inpainting, outpainting, and resolution expansion. Furthermore, it supports parallel inference by concurrently processing multiple queries using a shared KV cache. On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.83 with only 32 sampling steps, achieving over a 30 times speedup in inference and a 75 percent reduction in memory consumption compared to representative recent autoregressive models at a similar scale.
