Table of Contents
Fetching ...

From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation

Han Song, Yucheng Zhou, Jianbing Shen, Yu Cheng

Abstract

Combining Chain-of-Thought (CoT) with Reinforcement Learning (RL) improves text-to-image (T2I) generation, yet the underlying interaction between CoT's exploration and RL's optimization remains unclear. We present a systematic entropy-based analysis that yields three key insights: (1) CoT expands the generative exploration space, while RL contracts it toward high-reward regions; (2) final reward is strongly negatively correlated with both the mean and variance of image-token entropy, highlighting the need to reduce uncertainty and instability; and (3) the entropy of the textual CoT directly governs downstream image quality, with lower-entropy CoTs leading to better generations. Motivated by these findings, we propose Entropy-Guided Group Relative Policy Optimization (EG-GRPO), a fine-tuning strategy that reallocates optimization budget by uncertainty: low-entropy tokens are excluded from reward-driven updates to preserve stability, while high-entropy tokens receive an entropy bonus that encourages structured exploration without collapse. Experiments on standard T2I benchmarks demonstrate that EG-GRPO achieves state-of-the-art performance.

From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation

Abstract

Combining Chain-of-Thought (CoT) with Reinforcement Learning (RL) improves text-to-image (T2I) generation, yet the underlying interaction between CoT's exploration and RL's optimization remains unclear. We present a systematic entropy-based analysis that yields three key insights: (1) CoT expands the generative exploration space, while RL contracts it toward high-reward regions; (2) final reward is strongly negatively correlated with both the mean and variance of image-token entropy, highlighting the need to reduce uncertainty and instability; and (3) the entropy of the textual CoT directly governs downstream image quality, with lower-entropy CoTs leading to better generations. Motivated by these findings, we propose Entropy-Guided Group Relative Policy Optimization (EG-GRPO), a fine-tuning strategy that reallocates optimization budget by uncertainty: low-entropy tokens are excluded from reward-driven updates to preserve stability, while high-entropy tokens receive an entropy bonus that encourages structured exploration without collapse. Experiments on standard T2I benchmarks demonstrate that EG-GRPO achieves state-of-the-art performance.

Paper Structure

This paper contains 51 sections, 2 theorems, 22 equations, 7 figures, 6 tables.

Key Result

Proposition 1

For a batch $\mathcal{B}$, choose Then $\mathbb{E}_{\mathcal{B}}[B^{(i)}_{\mathrm{EG}}] \approx \kappa \cdot \mathbb{E}_{\mathcal{B}}[B^{(i)}_{\mathrm{GRPO}}].$ A detailed derivation and calibration discussion are deferred to Appendix app:budget-proof. Setting $\kappa=1$ yields batch-level budget neutrality in the calibrated upper-b

Figures (7)

  • Figure 1: Comparison of different text-to-image generation methods: (a) autoregressive text-to-image generation, (b) CoT, and (c) with CoT and GRPO optimization.
  • Figure 2: Entropy–reward distributions of different methods. CoT (Janus-Pro+CoT) expands the exploratory space with more diverse outputs, while GRPO fine-tuning (T2I-R1) contracts it toward higher-reward regions, yielding more stabilized, high-quality generations.
  • Figure 3: Left: Reward vs. CoT entropy (stable cases, Image Entropy Std $<$ 0.011). Higher CoT entropy correlates with lower image reward. Right: Reward distributions across different CoTs for the same prompt. Images from the same CoT cluster together, with certain CoTs consistently yielding lower rewards.
  • Figure 4: Left: Reward vs. entropy std. Higher instability (larger std) consistently lowers reward. Middle: Relation between entropy std (x-axis) and the negative correlation of reward–entropy mean (y-axis). Greater instability strengthens the negative correlation. Right: Reward vs. entropy mean under high-variance cases (std $>$ 0.03). Large std implies exploratory generation where RL has not converged; in this regime, reducing mean entropy is especially beneficial.
  • Figure 5: Entropy distributions of EG-GRPO vs. T2I-R1: left for textual CoT tokens, right for image tokens.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Proposition 1: Per-batch budget balance
  • Corollary 5.1: Preserving GRPO stationary points