Table of Contents
Fetching ...

One-Step Flow Policy Mirror Descent

Tianyi Chen, Haitong Ma, Na Li, Kai Wang, Bo Dai

TL;DR

Diffusion policies in online RL are powerful but hinder real-time inference due to iterative sampling. Flow Policy Mirror Descent (FPMD) enables 1-step sampling by tying policy variance to discretization error in straight-interpolation flow models, and presents two variants: FPMD-R (Flow) and FPMD-M (MeanFlow) with tractable online losses. Empirical results on MuJoCo and visual DMControl show FPMD matches diffusion-policy baselines in performance while reducing inference cost by orders of magnitude, and MeanFlow offers training-time efficiency without sacrificing much accuracy. Overall, FPMD demonstrates that flow-based policies can remain expressive during training yet deliver real-time inference suitable for online RL and robotics applications.

Abstract

Diffusion policies have achieved great success in online reinforcement learning (RL) due to their strong expressive capacity. However, the inference of diffusion policy models relies on a slow iterative sampling process, which limits their responsiveness. To overcome this limitation, we propose Flow Policy Mirror Descent (FPMD), an online RL algorithm that enables 1-step sampling during flow policy inference. Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight interpolation flow matching models, and requires no extra distillation or consistency training. We present two algorithm variants based on rectified flow policy and MeanFlow policy, respectively. Extensive empirical evaluations on MuJoCo and visual DeepMind Control Suite benchmarks demonstrate that our algorithms show strong performance comparable to diffusion policy baselines while requiring orders of magnitude less computational cost during inference.

One-Step Flow Policy Mirror Descent

TL;DR

Diffusion policies in online RL are powerful but hinder real-time inference due to iterative sampling. Flow Policy Mirror Descent (FPMD) enables 1-step sampling by tying policy variance to discretization error in straight-interpolation flow models, and presents two variants: FPMD-R (Flow) and FPMD-M (MeanFlow) with tractable online losses. Empirical results on MuJoCo and visual DMControl show FPMD matches diffusion-policy baselines in performance while reducing inference cost by orders of magnitude, and MeanFlow offers training-time efficiency without sacrificing much accuracy. Overall, FPMD demonstrates that flow-based policies can remain expressive during training yet deliver real-time inference suitable for online RL and robotics applications.

Abstract

Diffusion policies have achieved great success in online reinforcement learning (RL) due to their strong expressive capacity. However, the inference of diffusion policy models relies on a slow iterative sampling process, which limits their responsiveness. To overcome this limitation, we propose Flow Policy Mirror Descent (FPMD), an online RL algorithm that enables 1-step sampling during flow policy inference. Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight interpolation flow matching models, and requires no extra distillation or consistency training. We present two algorithm variants based on rectified flow policy and MeanFlow policy, respectively. Extensive empirical evaluations on MuJoCo and visual DeepMind Control Suite benchmarks demonstrate that our algorithms show strong performance comparable to diffusion policy baselines while requiring orders of magnitude less computational cost during inference.

Paper Structure

This paper contains 41 sections, 6 theorems, 27 equations, 6 figures, 6 tables, 1 algorithm.

Key Result

Proposition 1

[Proposition 3.3, hu2024adaflow] Define $p_t^*$ as the marginal distribution of the exact ODE $da_t=v(a_t,t|s)dt$. Assume $a_t\sim p_t=p_t^*$, and $p_{t+\epsilon_t}$ the distribution of $a_{t+\epsilon_t}$ following $a_{t+\epsilon_t}=a_t+\epsilon_t v\left(a_t, t|s\right)$, where $\epsilon_t \in [0, 1 where $\sigma^2\left(a_t, t|s\right)=\text{var}\left(a_1-a_0|a_t, s\right)$, $p_{t+\epsilon_t}^*$ d

Figures (6)

  • Figure 1: Policy inference time comparison between FPMD, Gaussian policy method SAC, and diffusion policy method DPMD.
  • Figure 2: Policy inference time comparison between FPMD, Gaussian policy method DDPG, and diffusion policy method DPMD.
  • Figure 3: Performance curves on visual continuous control tasks. FPMD outperforms all baselines with NFE=1 sampling.
  • Figure 4: Sampling trajectories of FPMD-R and DPMD policy after 5K and 200K training iterations. From left to right: FPMD-R policy trained for 5K iterations, FPMD-R policy trained for 200K iterations, DPMD policy trained for 5K iterations, DPMD policy trained for 200K iterations.
  • Figure 5: Sampling trajectories of FPMD-R on Ant-v4. At the beginning of training, the sampling trajectories are highly curved with non-uniform spacing between points of adjacent timesteps. As training proceeds, the trajectories become nearly straight with uniform spacing between points.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Theorem 5: MeanFlow Policy Mirror Descent
  • Theorem : MeanFlow Policy Mirror Descent