Table of Contents
Fetching ...

AEGPO: Adaptive Entropy-Guided Policy Optimization for Diffusion Models

Yuming Li, Qingyu Li, Chengyu Bai, Xiangyang Luo, Zeyue Xue, Wenyu Qin, Meng Wang, Yikai Wang, Shanghang Zhang

TL;DR

The paper tackles inefficiencies in GRPO-based RLHF for diffusion/flow models caused by static sampling and timesteps. It introduces Attention Entropy as a dual intrinsic proxy: the per-sample learning value $\Delta Entropy$ and per-timestep dispersion peaks $Entropy(t)$, which jointly guide a dual-level adaptive framework, AEGPO. Global Adaptive Allocation focuses rollouts on high-value prompts via $\Delta Entropy$, while Local Adaptive Exploration targets exploration at entropy peaks via TopK timesteps. Empirical results across multiple backbones and GRPO variants show up to 5× faster convergence and improved alignment metrics, with modest overhead, indicating strong generality and practical impact for diffusion-model alignment. The approach offers a lightweight, intrinsically guided alternative to reward-based or fixed schemes, potentially accelerating RLHF deployment in real-world generative systems.

Abstract

Reinforcement learning from human feedback (RLHF) shows promise for aligning diffusion and flow models, yet policy optimization methods such as GRPO suffer from inefficient and static sampling strategies. These methods treat all prompts and denoising steps uniformly, ignoring substantial variations in sample learning value as well as the dynamic nature of critical exploration moments. To address this issue, we conduct a detailed analysis of the internal attention dynamics during GRPO training and uncover a key insight: attention entropy can serve as a powerful dual-signal proxy. First, across different samples, the relative change in attention entropy (ΔEntropy), which reflects the divergence between the current policy and the base policy, acts as a robust indicator of sample learning value. Second, during the denoising process, the peaks of absolute attention entropy (Entropy(t)), which quantify attention dispersion, effectively identify critical timesteps where high-value exploration occurs. Building on this observation, we propose Adaptive Entropy-Guided Policy Optimization (AEGPO), a novel dual-signal, dual-level adaptive optimization strategy. At the global level, AEGPO uses ΔEntropy to dynamically allocate rollout budgets, prioritizing prompts with higher learning value. At the local level, it exploits the peaks of Entropy(t) to guide exploration selectively at critical high-dispersion timesteps rather than uniformly across all denoising steps. By focusing computation on the most informative samples and the most critical moments, AEGPO enables more efficient and effective policy optimization. Experiments on text-to-image generation tasks demonstrate that AEGPO significantly accelerates convergence and achieves superior alignment performance compared to standard GRPO variants.

AEGPO: Adaptive Entropy-Guided Policy Optimization for Diffusion Models

TL;DR

The paper tackles inefficiencies in GRPO-based RLHF for diffusion/flow models caused by static sampling and timesteps. It introduces Attention Entropy as a dual intrinsic proxy: the per-sample learning value and per-timestep dispersion peaks , which jointly guide a dual-level adaptive framework, AEGPO. Global Adaptive Allocation focuses rollouts on high-value prompts via , while Local Adaptive Exploration targets exploration at entropy peaks via TopK timesteps. Empirical results across multiple backbones and GRPO variants show up to 5× faster convergence and improved alignment metrics, with modest overhead, indicating strong generality and practical impact for diffusion-model alignment. The approach offers a lightweight, intrinsically guided alternative to reward-based or fixed schemes, potentially accelerating RLHF deployment in real-world generative systems.

Abstract

Reinforcement learning from human feedback (RLHF) shows promise for aligning diffusion and flow models, yet policy optimization methods such as GRPO suffer from inefficient and static sampling strategies. These methods treat all prompts and denoising steps uniformly, ignoring substantial variations in sample learning value as well as the dynamic nature of critical exploration moments. To address this issue, we conduct a detailed analysis of the internal attention dynamics during GRPO training and uncover a key insight: attention entropy can serve as a powerful dual-signal proxy. First, across different samples, the relative change in attention entropy (ΔEntropy), which reflects the divergence between the current policy and the base policy, acts as a robust indicator of sample learning value. Second, during the denoising process, the peaks of absolute attention entropy (Entropy(t)), which quantify attention dispersion, effectively identify critical timesteps where high-value exploration occurs. Building on this observation, we propose Adaptive Entropy-Guided Policy Optimization (AEGPO), a novel dual-signal, dual-level adaptive optimization strategy. At the global level, AEGPO uses ΔEntropy to dynamically allocate rollout budgets, prioritizing prompts with higher learning value. At the local level, it exploits the peaks of Entropy(t) to guide exploration selectively at critical high-dispersion timesteps rather than uniformly across all denoising steps. By focusing computation on the most informative samples and the most critical moments, AEGPO enables more efficient and effective policy optimization. Experiments on text-to-image generation tasks demonstrate that AEGPO significantly accelerates convergence and achieves superior alignment performance compared to standard GRPO variants.
Paper Structure (16 sections, 9 equations, 7 figures, 3 tables)

This paper contains 16 sections, 9 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: AEGPO significantly accelerates policy optimization. Compared to standard GRPO variants, AEGPO achieves 2$\times$ faster convergence on DanceGRPO (left) and 5$\times$ faster convergence on DiffusionNFT (right), while also reaching a superior final reward.
  • Figure 2: Illustration of varied Attention Entropy dynamics during GRPO training. (Left Panels): Generated images evolving across training steps. The top prompt shows minor visual changes, while the bottom prompt undergoes significant improvement. (Right Panels): The corresponding Absolute Attention Entropy ($Entropy(t)$) trajectories over the denoising steps $t$. The entropy curves for the top prompt remain clustered, indicating a low overall Relative Entropy Change ($\Delta$Entropy). Conversely, those for the bottom prompt show substantial divergence (indicating a high overall $\Delta$Entropy), which correlates with the degree of visual change. We provide additional qualitative visualizations, as well as a detailed analysis of the impact of different reward models on entropy dynamics, in the Appendix.
  • Figure 3: Validation of relative Attention Entropy change ($\Delta$Entropy) as a robust proxy for sample learning value. (Left): In early training, both $\Delta$Reward and $\Delta$Entropy rise, indicating active policy improvement accompanied by large adjustments in attention behavior. In later stages, $\Delta$Reward plateaus and $\Delta$Entropy correspondingly stabilizes or slightly decreases, reflecting that the model has reached a confident and stable attention configuration and no longer requires large policy deviations to obtain reward gains. (Right): Reward convergence comparison. The red line shows a model trained only on high-$\Delta$Entropy data, while the blue line shows a model trained only on low-$\Delta$Entropy data. Training on high-value samples leads to significantly faster convergence and a superior final reward, confirming their greater learning value.
  • Figure 4: Distribution of Top-K Absolute Attention Entropy ($Entropy(t)$) peaks across denoising steps $t$. The y-axis shows the probability that a given step $t$ contains one of the Top-K highest entropy peaks. The distribution is distinctly U-shaped, with high-dispersion peaks clustering in the very early (e.g., $t\approx1)$ and late (e.g., $t\approx13-15)$ stages of the denoising process.
  • Figure 5: Overview of the AEGPO framework, illustrating our dual-level adaptive strategy. (Top) The central AEGPO Adaptive Module (pink box) receives intrinsic signals derived from the model's policies. It uses these signals to guide the GRPO rollout and computation process. (Bottom Left) Global Adaptive Allocation: The per-sample $\Delta$Entropy (relative entropy change) is used as a proxy for sample value. High-value prompts are dynamically allocated a larger rollout budget (Rollout More), while low-value prompts receive fewer (Rollout Less). (Bottom Right) Local Adaptive Exploration: The per-timestep $Entropy(t)$ (absolute entropy) is used as a proxy for model dispersion. Exploration is dynamically triggered only at high-dispersion timesteps (orange circles), focusing exploration on the most critical moments of the denoising process.
  • ...and 2 more figures