Table of Contents
Fetching ...

Flow Matching Policy with Entropy Regularization

Ting Gao, Stavros Orfanoudakis, Nan Lin, Elvin Isufi, Winnie Daamen, Serge Hoogendoorn

Abstract

Diffusion-based policies have gained significant popularity in Reinforcement Learning (RL) due to their ability to represent complex, non-Gaussian distributions. Stochastic Differential Equation (SDE)-based diffusion policies often rely on indirect entropy control due to the intractability of the exact entropy, while also suffering from computationally prohibitive policy gradients through the iterative denoising chain. To overcome these issues, we propose Flow Matching Policy with Entropy Regularization (FMER), an Ordinary Differential Equation (ODE)-based online RL framework. FMER parameterizes the policy via flow matching and samples actions along a straight probability path, motivated by optimal transport. FMER leverages the model's generative nature to construct an advantage-weighted target velocity field from a candidate set, steering policy updates toward high-value regions. By deriving a tractable entropy objective, FMER enables principled maximum-entropy optimization for enhanced exploration. Experiments on sparse multi-goal FrankaKitchen benchmarks demonstrate that FMER outperforms state-of-the-art methods, while remaining competitive on standard MuJoco benchmarks. Moreover, FMER reduces training time by 7x compared to heavy diffusion baselines (QVPO) and 10-15% relative to efficient variants.

Flow Matching Policy with Entropy Regularization

Abstract

Diffusion-based policies have gained significant popularity in Reinforcement Learning (RL) due to their ability to represent complex, non-Gaussian distributions. Stochastic Differential Equation (SDE)-based diffusion policies often rely on indirect entropy control due to the intractability of the exact entropy, while also suffering from computationally prohibitive policy gradients through the iterative denoising chain. To overcome these issues, we propose Flow Matching Policy with Entropy Regularization (FMER), an Ordinary Differential Equation (ODE)-based online RL framework. FMER parameterizes the policy via flow matching and samples actions along a straight probability path, motivated by optimal transport. FMER leverages the model's generative nature to construct an advantage-weighted target velocity field from a candidate set, steering policy updates toward high-value regions. By deriving a tractable entropy objective, FMER enables principled maximum-entropy optimization for enhanced exploration. Experiments on sparse multi-goal FrankaKitchen benchmarks demonstrate that FMER outperforms state-of-the-art methods, while remaining competitive on standard MuJoco benchmarks. Moreover, FMER reduces training time by 7x compared to heavy diffusion baselines (QVPO) and 10-15% relative to efficient variants.
Paper Structure (18 sections, 19 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 19 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of the FMER framework. Left (Policy Training): The policy generates a set of candidate actions, which are evaluated by the critic (Q-network) to compute advantage-based weights. These weights guide the Weighted Conditional FM loss, while an entropy regularization term ensures exploration. Right (Interaction): During data collection, the agent samples multiple candidates and executes the highest-value action to store in the replay buffer.
  • Figure 2: Weighted vs. greedy policy updates. Contours show the ground truth Q-value landscape for a fixed state. From 16 sampled candidate actions, the top-1 (greedy) update moves toward the highest-valued sample (red arrow), leading to a local maximum. In contrast, the weighted update (green) aggregates all candidates via Q-based weights (small arrows) towards the global optimal.
  • Figure 3: Policy evolution on the 2D multi-goal task for the same state $s=(0,0)$. Background contours depict the learned critic $Q(s,a)$ over action space. For SAC, points show policy samples, and for the generative policies, red dots show the random noise samples and their sampling trajectory towards the final action.
  • Figure 4: State-conditioned policy visualization on the 2D multi-goal task. Background colors depict the ground-truth state-value landscape, where the four bright regions indicate the optimal goals. Red arrows show the policy-selected action at each state for SAC, DPMD, and FMER. The resulting vector field reveals the navigation strategy of each method.
  • Figure 5: Agents working in the default FrankaKitchen environment with a budget of 280 steps. Top row (a-e): FMER is able to finish 5 tasks. Bottom row (f-h): The best baseline model (DIPO) is able to finish 3 tasks. Task description can be found in Appendix \ref{['sec:appendix_exp']}.
  • ...and 1 more figures