Smart-GRPO: Smartly Sampling Noise for Efficient RL of Flow-Matching Models
Benjamin Yu, Jackie Liu, Justin Cui
TL;DR
This work tackles reinforcement learning for flow-matching models, which are inherently deterministic and thus ill-suited for policy optimization. It introduces Smart-GRPO, a reward-guided noise-search framework that parameterizes the input-noise with a Gaussian distribution $\mathcal{N}(\mu,\Sigma)$ and iteratively refines $\mu$ and $\sigma$ by sampling $K$ noises per iteration, computing a one-step decoded image via $x_0^{(i)} \approx z_i - t\, v_{\theta}(z_i,t)$, evaluating rewards with a pretrained model $f$, and updating the top-$T$ noises in a Cross-Entropy–like fashion for $N$ rounds. Empirical results show that Smart-GRPO improves both reward metrics (e.g., ImageReward) and visual quality over Flow-GRPO and baseline diffusion models, with faster convergence and greater stability, while requiring no architectural changes. The method demonstrates a practical, noise-aware pathway to RL in flow-based generative pipelines and suggests broader applicability to noise optimization in reinforcement learning for generative models. Future work includes designing more robust reward signals, exploring alternative noise-update strategies, and scaling to larger models and benchmarks.
Abstract
Recent advancements in flow-matching have enabled high-quality text-to-image generation. However, the deterministic nature of flow-matching models makes them poorly suited for reinforcement learning, a key tool for improving image quality and human alignment. Prior work has introduced stochasticity by perturbing latents with random noise, but such perturbations are inefficient and unstable. We propose Smart-GRPO, the first method to optimize noise perturbations for reinforcement learning in flow-matching models. Smart-GRPO employs an iterative search strategy that decodes candidate perturbations, evaluates them with a reward function, and refines the noise distribution toward higher-reward regions. Experiments demonstrate that Smart-GRPO improves both reward optimization and visual quality compared to baseline methods. Our results suggest a practical path toward reinforcement learning in flow-matching frameworks, bridging the gap between efficient training and human-aligned generation.
