Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy
Xiaoxiao Ma, Feng Zhao, Pengyang Ling, Haibo Qiu, Zhixiang Wei, Hu Yu, Jie Huang, Zhixiong Zeng, Lin Ma
TL;DR
The paper tackles sampling inefficiencies in autoregressive image generation caused by uneven information density across image tokens. It introduces entropy-informed decoding, using per-token entropy to drive a dynamic temperature and an entropy-aware speculative decoding mechanism, with extensions to mask-based and scale-wise AR models. Key contributions include a practical temperature mapping $T= T_0 e^{-\epsilon/\alpha} + \theta$ and an entropy-dependent acceptance rule for speculation, yielding improvements in image quality and a speedup of approximately 15% in inference while achieving about 85% of the baseline cost. Extensive experiments across four AR models and multiple datasets demonstrate stronger generation fidelity and faster sampling, highlighting entropy as a robust signal for decoding in vision-language models. The approach is training-free and complementary to existing acceleration techniques, with potential for integration into training-time optimizations and broader multimodal generation systems.
Abstract
In this work, we first revisit the sampling issues in current autoregressive (AR) image generation models and identify that image tokens, unlike text tokens, exhibit lower information density and non-uniform spatial distribution. Accordingly, we present an entropy-informed decoding strategy that facilitates higher autoregressive generation quality with faster synthesis speed. Specifically, the proposed method introduces two main innovations: 1) dynamic temperature control guided by spatial entropy of token distributions, enhancing the balance between content diversity, alignment accuracy, and structural coherence in both mask-based and scale-wise models, without extra computational overhead, and 2) entropy-aware acceptance rules in speculative decoding, achieving near-lossless generation at about 85\% of the inference cost of conventional acceleration methods. Extensive experiments across multiple benchmarks using diverse AR image generation models demonstrate the effectiveness and generalizability of our approach in enhancing both generation quality and sampling speed.
