Table of Contents
Fetching ...

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

Yitong Li, Junsong Chen, Shuchen Xue, Pengcuo Zeren, Siyuan Fu, Dinghao Yang, Yangyang Tang, Junjie Bai, Ping Luo, Song Han, Enze Xie

Abstract

Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden. To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework. First, we utilize high-throughput NVFP4 rollouts to generate a massive candidate pool and extract a highly contrastive subset. Second, we regenerate these selected samples in BF16 precision and optimize the policy exclusively on them. By decoupling candidate exploration from policy optimization, Sol-RL integrates the algorithmic mechanisms of rollout scaling with the system-level throughput gains of NVFP4. This synergistic algorithm-hardware design effectively accelerates the rollout phase while reserving high-fidelity samples for optimization. We empirically demonstrate that our framework maintains the training integrity of BF16 precision pipeline while fully exploiting the throughput gains enabled by FP4 arithmetic. Extensive experiments across SANA, FLUX.1, and SD3.5-L substantiate that our approach delivers superior alignment performance across multiple metrics while accelerating training convergence by up to $4.64\times$, unlocking the power of massive rollout scaling at a fraction of the cost.

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

Abstract

Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden. To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework. First, we utilize high-throughput NVFP4 rollouts to generate a massive candidate pool and extract a highly contrastive subset. Second, we regenerate these selected samples in BF16 precision and optimize the policy exclusively on them. By decoupling candidate exploration from policy optimization, Sol-RL integrates the algorithmic mechanisms of rollout scaling with the system-level throughput gains of NVFP4. This synergistic algorithm-hardware design effectively accelerates the rollout phase while reserving high-fidelity samples for optimization. We empirically demonstrate that our framework maintains the training integrity of BF16 precision pipeline while fully exploiting the throughput gains enabled by FP4 arithmetic. Extensive experiments across SANA, FLUX.1, and SD3.5-L substantiate that our approach delivers superior alignment performance across multiple metrics while accelerating training convergence by up to , unlocking the power of massive rollout scaling at a fraction of the cost.

Paper Structure

This paper contains 31 sections, 11 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Sol-RL enables efficient and high-fidelity text-to-image alignment. (Left) High-quality images generated by FLUX.1 and SANA fine-tuned with our method, demonstrating superior generation capabilities across diverse styles. (Right) ImageReward training curves. They demonstrate that Sol-RL achieves substantial wall-clock speedups (up to $\mathbf{4.64\times}$) to reach an equivalent reward level, ultimately converging to a higher alignment ceiling.
  • Figure 2: Decoupled two-stage reinforcement learning pipeline of Sol-RL. We separate the high-throughput FP4 exploration from the selective BF16 high-contrastive rollout. This framework achieves up to $2.4\times$ acceleration compared to naive scaling while avoiding quantization-induced corruption, introducing merely a 2% computational overhead.
  • Figure 3: Pitfalls and Potential of NVFP4 rollouts. (a) Time breakdown of high-precision rollout scaling and direct quantized rollout. The x-axis labels follow the format $K$-in-$N$ ($P$), denoting that $K$ samples are selected for training from $N$ generated rollouts under $P$ precision. (b) Directly integrating FP4 rollout in RL pipeline leads to severe instability and performance degradation compared to the BF16 baseline. (c) Conversely, the dense diagonal distribution of intra-group relative reward rankings validates NVFP4 quantized rollouts as a reliable proxy for reward sorting.
  • Figure 4: Comparison across diverse foundation models and alignment metrics. Evaluated under identical wall-clock budgets (GPU Hours), Sol-RL (green) consistently outperforms the DiffusionNFT baseline (grey). Across all tested combinations of models and reward functions, our decoupled scaling strategy accelerates convergence to the baseline's equivalent performance by up to $4.64\times$, ultimately converging to a remarkably higher final alignment ceiling.
  • Figure 5: Visual comparison before and after Sol-RL. Compared to the SANA base model without fine-tuning (top row), the counterpart optimized across multiple rewards (HPSv2, PickScore, CLIPScore and OCR) via Sol-RL (bottom row) exhibits substantial improvements in complex detail rendering and semantic alignment across various prompts.
  • ...and 4 more figures