Table of Contents
Fetching ...

Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy

Xiaoxiao Ma, Feng Zhao, Pengyang Ling, Haibo Qiu, Zhixiang Wei, Hu Yu, Jie Huang, Zhixiong Zeng, Lin Ma

TL;DR

The paper tackles sampling inefficiencies in autoregressive image generation caused by uneven information density across image tokens. It introduces entropy-informed decoding, using per-token entropy to drive a dynamic temperature and an entropy-aware speculative decoding mechanism, with extensions to mask-based and scale-wise AR models. Key contributions include a practical temperature mapping $T= T_0 e^{-\epsilon/\alpha} + \theta$ and an entropy-dependent acceptance rule for speculation, yielding improvements in image quality and a speedup of approximately 15% in inference while achieving about 85% of the baseline cost. Extensive experiments across four AR models and multiple datasets demonstrate stronger generation fidelity and faster sampling, highlighting entropy as a robust signal for decoding in vision-language models. The approach is training-free and complementary to existing acceleration techniques, with potential for integration into training-time optimizations and broader multimodal generation systems.

Abstract

In this work, we first revisit the sampling issues in current autoregressive (AR) image generation models and identify that image tokens, unlike text tokens, exhibit lower information density and non-uniform spatial distribution. Accordingly, we present an entropy-informed decoding strategy that facilitates higher autoregressive generation quality with faster synthesis speed. Specifically, the proposed method introduces two main innovations: 1) dynamic temperature control guided by spatial entropy of token distributions, enhancing the balance between content diversity, alignment accuracy, and structural coherence in both mask-based and scale-wise models, without extra computational overhead, and 2) entropy-aware acceptance rules in speculative decoding, achieving near-lossless generation at about 85\% of the inference cost of conventional acceleration methods. Extensive experiments across multiple benchmarks using diverse AR image generation models demonstrate the effectiveness and generalizability of our approach in enhancing both generation quality and sampling speed.

Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy

TL;DR

The paper tackles sampling inefficiencies in autoregressive image generation caused by uneven information density across image tokens. It introduces entropy-informed decoding, using per-token entropy to drive a dynamic temperature and an entropy-aware speculative decoding mechanism, with extensions to mask-based and scale-wise AR models. Key contributions include a practical temperature mapping and an entropy-dependent acceptance rule for speculation, yielding improvements in image quality and a speedup of approximately 15% in inference while achieving about 85% of the baseline cost. Extensive experiments across four AR models and multiple datasets demonstrate stronger generation fidelity and faster sampling, highlighting entropy as a robust signal for decoding in vision-language models. The approach is training-free and complementary to existing acceleration techniques, with potential for integration into training-time optimizations and broader multimodal generation systems.

Abstract

In this work, we first revisit the sampling issues in current autoregressive (AR) image generation models and identify that image tokens, unlike text tokens, exhibit lower information density and non-uniform spatial distribution. Accordingly, we present an entropy-informed decoding strategy that facilitates higher autoregressive generation quality with faster synthesis speed. Specifically, the proposed method introduces two main innovations: 1) dynamic temperature control guided by spatial entropy of token distributions, enhancing the balance between content diversity, alignment accuracy, and structural coherence in both mask-based and scale-wise models, without extra computational overhead, and 2) entropy-aware acceptance rules in speculative decoding, achieving near-lossless generation at about 85\% of the inference cost of conventional acceleration methods. Extensive experiments across multiple benchmarks using diverse AR image generation models demonstrate the effectiveness and generalizability of our approach in enhancing both generation quality and sampling speed.

Paper Structure

This paper contains 37 sections, 10 equations, 25 figures, 7 tables.

Figures (25)

  • Figure 1: Top row: Our method generates images with finer details and better structure. Bottom row: Combined with existing acceleration methods, ours reduces inference cost by 15%. (Left two pairs are from LlamaGen sun2024llamagen; right from Lumina-mGPT liu2024lumina_mgpt. Inference steps and latency are reported.)
  • Figure 2: (a) Comparison of information density between image and text. Histogram of average frequency-domain embeddings from LlamaGen sun2024llamagen (image) and Qwen2 yang2024qwen2technicalreport (text) show the uneven spatial distribution in images with a large amount of low-frequency components. (b) Qualitative results under various configurations. High CFG (Classfier-Free Guidance) or low top-$K$ often harms fidelity, while lower CFG with higher top-$K$ improves fidelity but may reduce text-image consistency. (c) Quantitative evaluation of LlamaGen under different sampling settings.
  • Figure 3: (a) Entropy map during generation: complex regions exhibit higher entropy (more dispersed probabilities), while simpler areas show lower entropy. (b) Histogram of entropy distribution on COCO val2017 (from LlamaGen Stage II). (c) Varying temperature by entropy range affects FID and CLIP score: lower-entropy tokens benefit from higher temperatures, and vice versa.
  • Figure 4: During generation of mask-based model bai2024meissonic, a large number of early steps (0$\sim$50) are allocated to compute tokens in simple regions, while only a few later steps (e.g., 50$\sim$63) for generating complex content. This often leads to degraded quality in the main visual subjects.
  • Figure 5: Visual comparison on next-token model. Examples are from Lumina-mGPT, proposed method ("Ours") maintains richer content while offering more accurate structure and finer details.
  • ...and 20 more figures