Table of Contents
Fetching ...

Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

Zhuoyang Zhang, Luke J. Huang, Chengyue Wu, Shang Yang, Kelly Peng, Yao Lu, Song Han

TL;DR

Two key techniques are introduced: Flexible Parallelized Autoregressive Modeling, a novel architecture that enables arbitrary generation ordering and degrees of parallelization, and Locality-aware Generation Ordering, a novel schedule that forms groups to minimize intra-group dependencies and maximize contextual support, enhancing generation quality.

Abstract

We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works have tried to parallelize next-patch prediction by shifting to multi-patch prediction to accelerate the process, but only achieved limited parallelization. To achieve high parallelization while maintaining generation quality, we introduce two key techniques: (1) Flexible Parallelized Autoregressive Modeling, a novel architecture that enables arbitrary generation ordering and degrees of parallelization. It uses learnable position query tokens to guide generation at target positions while ensuring mutual visibility among concurrently generated tokens for consistent parallel decoding. (2) Locality-aware Generation Ordering, a novel schedule that forms groups to minimize intra-group dependencies and maximize contextual support, enhancing generation quality. With these designs, we reduce the generation steps from 256 to 20 (256$\times$256 res.) and 1024 to 48 (512$\times$512 res.) without compromising quality on the ImageNet class-conditional generation, and achieving at least 3.4$\times$ lower latency than previous parallelized autoregressive models.

Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

TL;DR

Two key techniques are introduced: Flexible Parallelized Autoregressive Modeling, a novel architecture that enables arbitrary generation ordering and degrees of parallelization, and Locality-aware Generation Ordering, a novel schedule that forms groups to minimize intra-group dependencies and maximize contextual support, enhancing generation quality.

Abstract

We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works have tried to parallelize next-patch prediction by shifting to multi-patch prediction to accelerate the process, but only achieved limited parallelization. To achieve high parallelization while maintaining generation quality, we introduce two key techniques: (1) Flexible Parallelized Autoregressive Modeling, a novel architecture that enables arbitrary generation ordering and degrees of parallelization. It uses learnable position query tokens to guide generation at target positions while ensuring mutual visibility among concurrently generated tokens for consistent parallel decoding. (2) Locality-aware Generation Ordering, a novel schedule that forms groups to minimize intra-group dependencies and maximize contextual support, enhancing generation quality. With these designs, we reduce the generation steps from 256 to 20 (256256 res.) and 1024 to 48 (512512 res.) without compromising quality on the ImageNet class-conditional generation, and achieving at least 3.4 lower latency than previous parallelized autoregressive models.

Paper Structure

This paper contains 28 sections, 3 equations, 14 figures, 7 tables, 1 algorithm.

Figures (14)

  • Figure 1: Performance comparison among parallelized autoregressive models on ImageNet 256$\times$256. We significantly reduce the generation steps and achieve at least 3.4x lower latency compared with previous models.
  • Figure 2: Visualization of attention maps in the LlamaGen-1.4B model. There is strong spatial locality, as the attention of a decoding token is concentrated on nearby spatial tokens. LlamaGen encodes images into 24 $\times$ 24 tokens, where a token that is 24 positions earlier in the attention map corresponds to the token directly above it in the 2D grid.
  • Figure 3: Raster Order vs. Flexible Parallelized Autoregressive Modeling. (a) In raster order, each token simultaneously provides context and predicts the next token, restricting flexibility and efficiency. (b) Our approach decouples these roles: previously generated tokens supply context, while position query tokens drive parallel generation at arbitrary target positions. This separation enables both flexible order and efficient parallelization.
  • Figure 4: Illustration of the training attention mask.Context Attention allows subsequent tokens to attend to the context tokens causally. Query Attention ensures mutual visibility among the position query tokens within the same step, and prevents any subsequent tokens from attending to the query tokens. For example, image token 4 can be attended to by all subsequent tokens, including image tokens and position query tokens, to provide context information. The two position query tokens $P_3$ and $P_5$ in the same generation step attend to the condition, to the image token 4, and to each other, while ignoring the earlier query $P_4$.
  • Figure 5: Illustration of the inference attention mask.Encoding with image tokens and Decoding with position query tokens can be fused into a single step. Taking step 2 in Figure \ref{['fig:3_pipeline']} (b) as the example, it simultaneously encodes the previously generated image tokens 3, 5 to update the KV-cache and decodes the desired image tokens 1, 2 and 6 in parallel.
  • ...and 9 more figures