Table of Contents
Fetching ...

Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation

Zhi-Kai Chen, Jun-Peng Jiang, Han-Jia Ye, De-Chuan Zhan

TL;DR

Hawk introduces spatial speculative decoding for autoregressive image generation by using dual-direction draft heads that reason over horizontal and vertical image contexts, expanding the draft sampling space while preserving the target distribution through a tree-structured verification process. By caching vertical predictions and integrating a 2D spatial sampling pool, Hawk achieves a 1.71× speedup on 768×768 image generation with Lumina-mGPT while maintaining image fidelity and CLIP alignment. The approach is supported by attention-sinking analyses showing reliance on spatial neighbors, and a dedicated study on vertical draft heads demonstrates additional gains in diversity and efficiency. Overall, Hawk offers a practical acceleration technique for AR image generation with low memory overhead and strong preservation of output quality, opening avenues for real-time or near-real-time applications.

Abstract

Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.

Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation

TL;DR

Hawk introduces spatial speculative decoding for autoregressive image generation by using dual-direction draft heads that reason over horizontal and vertical image contexts, expanding the draft sampling space while preserving the target distribution through a tree-structured verification process. By caching vertical predictions and integrating a 2D spatial sampling pool, Hawk achieves a 1.71× speedup on 768×768 image generation with Lumina-mGPT while maintaining image fidelity and CLIP alignment. The approach is supported by attention-sinking analyses showing reliance on spatial neighbors, and a dedicated study on vertical draft heads demonstrates additional gains in diversity and efficiency. Overall, Hawk offers a practical acceleration technique for AR image generation with low memory overhead and strong preservation of output quality, opening avenues for real-time or near-real-time applications.

Abstract

Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.

Paper Structure

This paper contains 15 sections, 13 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: The difference in average attention logits between image generation and text generation, displayed with Lumina-mGPT in a 1D raster order (top), LlaMA 2 (bottom). The average attention logits of the image generation model exhibit a strong dependency on spatially neighboring points. The red box highlights the previous row in the image generation process.
  • Figure 2: An overview of our Hawk method is presented. During each iteration of the inference process, horizontal and vertical speculations are generated using the draft head. The vertical speculations are stored in the Speculation Cache for future use when processing subsequent lines. Meanwhile, the horizontal speculations are combined with the previous vertical speculations to create the speculation sampling pool. From this pool, tree decoding candidates are generated, followed by a verification step akin to tree speculative decoding.
  • Figure 3: Quality evaluation of Hawk method. For each image type, the images, from left to right, are generated by the base model, Medusa, Hawk with vertical heads, and Hawk with spatial heads. The Hawk method maintains the performance of the baseline model while enhancing inference speed.
  • Figure 4: The training loss of draft heads at different locations is related to the current decoding point. Vertical heads experience relatively less performance decay as the speculation depth increases.
  • Figure 5: The KL divergence between the vertical and horizontal draft heads during speculative decoding for the given prompt. The difference between the vertical and horizontal draft heads is more pronounced when generating complex areas of the image.
  • ...and 4 more figures