Table of Contents
Fetching ...

Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching

Feng Wang, Zihao Yu

TL;DR

This work identifies a fundamental flaw in SDE-based sampling for Flow Matching in reinforcement learning, where injected noise distorts reward signals and hampers learning. It introduces Coefficients-Preserving Sampling (CPS), a DDIM-inspired framework that preserves the correct coefficient balance between the sample and noise at every step, yielding noise-free samples even under high stochasticity. The proposed Flow-CPS method improves reward estimation and convergence speed across several GRPO-based tasks (GenEval, PickScore, HPSv2, OCR) compared to Flow-GRPO and Dance-GRPO baselines, with practical implications for faster, more reliable diffusion-model RL. The authors provide theoretical analysis linking Flow-SDE to a Taylor-approximation error and demonstrate a scalable, implementable approach with public code forthcoming.

Abstract

Reinforcement Learning (RL) has recently emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models, specifically for enhancing output quality and alignment with prompts. A critical step for applying online RL methods on Flow Matching is the introduction of stochasticity into the deterministic framework, commonly realized by Stochastic Differential Equation (SDE). Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images, which we found to be detrimental to the reward learning process. A rigorous theoretical analysis traces the origin of this noise to an excess of stochasticity injected during inference. To address this, we draw inspiration from Denoising Diffusion Implicit Models (DDIM) to reformulate the sampling process. Our proposed method, Coefficients-Preserving Sampling (CPS), eliminates these noise artifacts. This leads to more accurate reward modeling, ultimately enabling faster and more stable convergence for reinforcement learning-based optimizers like Flow-GRPO and Dance-GRPO. Code will be released at https://github.com/IamCreateAI/FlowCPS

Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching

TL;DR

This work identifies a fundamental flaw in SDE-based sampling for Flow Matching in reinforcement learning, where injected noise distorts reward signals and hampers learning. It introduces Coefficients-Preserving Sampling (CPS), a DDIM-inspired framework that preserves the correct coefficient balance between the sample and noise at every step, yielding noise-free samples even under high stochasticity. The proposed Flow-CPS method improves reward estimation and convergence speed across several GRPO-based tasks (GenEval, PickScore, HPSv2, OCR) compared to Flow-GRPO and Dance-GRPO baselines, with practical implications for faster, more reliable diffusion-model RL. The authors provide theoretical analysis linking Flow-SDE to a Taylor-approximation error and demonstrate a scalable, implementable approach with public code forthcoming.

Abstract

Reinforcement Learning (RL) has recently emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models, specifically for enhancing output quality and alignment with prompts. A critical step for applying online RL methods on Flow Matching is the introduction of stochasticity into the deterministic framework, commonly realized by Stochastic Differential Equation (SDE). Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images, which we found to be detrimental to the reward learning process. A rigorous theoretical analysis traces the origin of this noise to an excess of stochasticity injected during inference. To address this, we draw inspiration from Denoising Diffusion Implicit Models (DDIM) to reformulate the sampling process. Our proposed method, Coefficients-Preserving Sampling (CPS), eliminates these noise artifacts. This leads to more accurate reward modeling, ultimately enabling faster and more stable convergence for reinforcement learning-based optimizers like Flow-GRPO and Dance-GRPO. Code will be released at https://github.com/IamCreateAI/FlowCPS

Paper Structure

This paper contains 22 sections, 1 theorem, 27 equations, 12 figures, 5 tables.

Key Result

Theorem 1

Flow-SDE is a first-order Taylor approximation of Flow-CPWS in the limit of $\sigma_t \sqrt{\Delta t} \ll t-\Delta t$ and $\Delta t \to 0$, with a noise level error of $\sqrt{\frac{(\sigma_t \Delta t)^2}{t}+ (\frac{\sigma_t^2 \Delta t}{2t})^2}$.

Figures (12)

  • Figure 1: The images sampled by Flow-SDE exhibit severe noise, and the noise magnitude increases with higher sampling noise parameters. In contrast, our Coefficients-Preserving Sampling (CPS) algorithm produces noise-free images regardless of the noise level. Notably, these images will be fed into a reward model, and the noisy images may lead to inaccurate rewards.
  • Figure 2: The ideal noise level $t$ and SDE noise level (Equation \ref{['eq:total_noise_error']}) for Flow-GRPO and Dance-GRPO with $1000$, $16$, and $4$ sampling steps. Except for the numerical problem around $t=0$ and $t=1$, the error of the noise level increases as the sampling step decreases.
  • Figure 3: (a): DDIM deterministic sampling process. Note that $\epsilon$ is a random Gaussian noise, which is almost orthogonal to the sample $\bm{x}_0$. Since $\sqrt{\alpha_t}^2+\sqrt{\beta_t}^2=1$, the trajectory is part of a quarter-circle at each step. (b): DDIM sampling process with stochasticity (Equation \ref{['eq:sample-eq-gen']}). $\epsilon_t$ is also a random Gaussian noise, which is almost orthogonal to $\bm{x}_0$ and $\epsilon_\theta$. (c): Flow matching ODE Sampler. The trajectory is a straight line at each step. (d): Our proposed Coefficients-Preserving Sampling (Equation \ref{['eq:simple_flow_eta']}). These figures are from blog wang2024zhihu.
  • Figure 4: Left: PickScore optimization based on FLUX.1-dev. The sampling step number is $6$ for training and $28$ for evaluation. Right: PickScore optimization based on FLUX.1-schnell. The sampling step number is $4$ for both training and evaluation. Note that there is no stochasticity during evaluation, so the rewards of the two sampling methods are the same at the beginning. For all experiments, we set $\eta=0.9$.
  • Figure 5: Left: GenEval optimization based on SD3.5. The sampling step number is $10$ for training and $40$ for evaluation. It is crucial to note that the exclusion of the KL loss resulted in significant performance degradation or model collapse for both sampling methods. We set $\eta=0.7$ in these experiments. Right: HPSv2 optimization based on FLUX.1-dev. Since the codebase of Dance-GRPO does not provide online evaluation, we show the moving average of the training curves and leave the final evaluation performance in Table \ref{['tab:hps']}. We set $\eta=0.7$ for our method and $\eta=0.3$ (default value) for Dance-GRPO.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Definition 1: Coefficients-Preserving Sampling
  • Theorem 1