Table of Contents
Fetching ...

Smart-GRPO: Smartly Sampling Noise for Efficient RL of Flow-Matching Models

Benjamin Yu, Jackie Liu, Justin Cui

TL;DR

This work tackles reinforcement learning for flow-matching models, which are inherently deterministic and thus ill-suited for policy optimization. It introduces Smart-GRPO, a reward-guided noise-search framework that parameterizes the input-noise with a Gaussian distribution $\mathcal{N}(\mu,\Sigma)$ and iteratively refines $\mu$ and $\sigma$ by sampling $K$ noises per iteration, computing a one-step decoded image via $x_0^{(i)} \approx z_i - t\, v_{\theta}(z_i,t)$, evaluating rewards with a pretrained model $f$, and updating the top-$T$ noises in a Cross-Entropy–like fashion for $N$ rounds. Empirical results show that Smart-GRPO improves both reward metrics (e.g., ImageReward) and visual quality over Flow-GRPO and baseline diffusion models, with faster convergence and greater stability, while requiring no architectural changes. The method demonstrates a practical, noise-aware pathway to RL in flow-based generative pipelines and suggests broader applicability to noise optimization in reinforcement learning for generative models. Future work includes designing more robust reward signals, exploring alternative noise-update strategies, and scaling to larger models and benchmarks.

Abstract

Recent advancements in flow-matching have enabled high-quality text-to-image generation. However, the deterministic nature of flow-matching models makes them poorly suited for reinforcement learning, a key tool for improving image quality and human alignment. Prior work has introduced stochasticity by perturbing latents with random noise, but such perturbations are inefficient and unstable. We propose Smart-GRPO, the first method to optimize noise perturbations for reinforcement learning in flow-matching models. Smart-GRPO employs an iterative search strategy that decodes candidate perturbations, evaluates them with a reward function, and refines the noise distribution toward higher-reward regions. Experiments demonstrate that Smart-GRPO improves both reward optimization and visual quality compared to baseline methods. Our results suggest a practical path toward reinforcement learning in flow-matching frameworks, bridging the gap between efficient training and human-aligned generation.

Smart-GRPO: Smartly Sampling Noise for Efficient RL of Flow-Matching Models

TL;DR

This work tackles reinforcement learning for flow-matching models, which are inherently deterministic and thus ill-suited for policy optimization. It introduces Smart-GRPO, a reward-guided noise-search framework that parameterizes the input-noise with a Gaussian distribution and iteratively refines and by sampling noises per iteration, computing a one-step decoded image via , evaluating rewards with a pretrained model , and updating the top- noises in a Cross-Entropy–like fashion for rounds. Empirical results show that Smart-GRPO improves both reward metrics (e.g., ImageReward) and visual quality over Flow-GRPO and baseline diffusion models, with faster convergence and greater stability, while requiring no architectural changes. The method demonstrates a practical, noise-aware pathway to RL in flow-based generative pipelines and suggests broader applicability to noise optimization in reinforcement learning for generative models. Future work includes designing more robust reward signals, exploring alternative noise-update strategies, and scaling to larger models and benchmarks.

Abstract

Recent advancements in flow-matching have enabled high-quality text-to-image generation. However, the deterministic nature of flow-matching models makes them poorly suited for reinforcement learning, a key tool for improving image quality and human alignment. Prior work has introduced stochasticity by perturbing latents with random noise, but such perturbations are inefficient and unstable. We propose Smart-GRPO, the first method to optimize noise perturbations for reinforcement learning in flow-matching models. Smart-GRPO employs an iterative search strategy that decodes candidate perturbations, evaluates them with a reward function, and refines the noise distribution toward higher-reward regions. Experiments demonstrate that Smart-GRPO improves both reward optimization and visual quality compared to baseline methods. Our results suggest a practical path toward reinforcement learning in flow-matching frameworks, bridging the gap between efficient training and human-aligned generation.

Paper Structure

This paper contains 14 sections, 6 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Overview of Smart-GRPO. The method begins by initializing a Gaussian noise distribution parameterized by $(\mu, \sigma)$. At each iteration, candidate noise samples are drawn, applied to perturb the latent representation, denoised for one step, and decoded into images. A reward model evaluates the resulting images, and the top-k k noise samples are used to update the distribution parameters. After K iterations, the final mean $\mu$ is selected as the optimized noise for GRPO training.
  • Figure 2: Training results of Smart-GRPO over 360 epochs. Figure (a) is trained with ImageReward, and Figure (b) is trained using the Aesthetic score
  • Figure 3: Figure for ablation studies
  • Figure 4: Sensitivity analysis for number of iterations used for Smart-GRPO. For 1 iteration, performance is unstable and fluctuates. When number of iterations increases, performance increases as iterations improve parameters more reliably.
  • Figure 5: Intermediate approximations from Equation \ref{['approximation']} during flow-matching generation of the prompt ‘A steaming cup of coffee’. Starting from a noise level of 0.6 and decoded over 10 steps, earlier timesteps yield outputs resembling noise, while later timesteps progressively form low-quality images.