Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching

Feng Wang; Zihao Yu

Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching

Feng Wang, Zihao Yu

TL;DR

This work identifies a fundamental flaw in SDE-based sampling for Flow Matching in reinforcement learning, where injected noise distorts reward signals and hampers learning. It introduces Coefficients-Preserving Sampling (CPS), a DDIM-inspired framework that preserves the correct coefficient balance between the sample and noise at every step, yielding noise-free samples even under high stochasticity. The proposed Flow-CPS method improves reward estimation and convergence speed across several GRPO-based tasks (GenEval, PickScore, HPSv2, OCR) compared to Flow-GRPO and Dance-GRPO baselines, with practical implications for faster, more reliable diffusion-model RL. The authors provide theoretical analysis linking Flow-SDE to a Taylor-approximation error and demonstrate a scalable, implementable approach with public code forthcoming.

Abstract

Reinforcement Learning (RL) has recently emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models, specifically for enhancing output quality and alignment with prompts. A critical step for applying online RL methods on Flow Matching is the introduction of stochasticity into the deterministic framework, commonly realized by Stochastic Differential Equation (SDE). Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images, which we found to be detrimental to the reward learning process. A rigorous theoretical analysis traces the origin of this noise to an excess of stochasticity injected during inference. To address this, we draw inspiration from Denoising Diffusion Implicit Models (DDIM) to reformulate the sampling process. Our proposed method, Coefficients-Preserving Sampling (CPS), eliminates these noise artifacts. This leads to more accurate reward modeling, ultimately enabling faster and more stable convergence for reinforcement learning-based optimizers like Flow-GRPO and Dance-GRPO. Code will be released at https://github.com/IamCreateAI/FlowCPS

Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching

TL;DR

Abstract

Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (2)