Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation
Zhi-Kai Chen, Jun-Peng Jiang, Han-Jia Ye, De-Chuan Zhan
TL;DR
Hawk introduces spatial speculative decoding for autoregressive image generation by using dual-direction draft heads that reason over horizontal and vertical image contexts, expanding the draft sampling space while preserving the target distribution through a tree-structured verification process. By caching vertical predictions and integrating a 2D spatial sampling pool, Hawk achieves a 1.71× speedup on 768×768 image generation with Lumina-mGPT while maintaining image fidelity and CLIP alignment. The approach is supported by attention-sinking analyses showing reliance on spatial neighbors, and a dedicated study on vertical draft heads demonstrates additional gains in diversity and efficiency. Overall, Hawk offers a practical acceleration technique for AR image generation with low memory overhead and strong preservation of output quality, opening avenues for real-time or near-real-time applications.
Abstract
Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.
