ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning
Tonghe Zhang, Chao Yu, Sichang Su, Yu Wang
TL;DR
ReinFlow introduces an online reinforcement learning framework to fine-tune pre-trained flow-matching policies for continuous robotic control by injecting learnable noise into the flow trajectory, turning it into a discrete-time Markov process with exact log-likelihoods even at few denoising steps. The approach enables principled policy gradient optimization (PPO) with a compact noise-injection network that balances exploration and exploitation, and supports various flow variants such as Rectified Flow and Shortcut Models. Empirical results across locomotion and manipulation tasks show substantial improvements in reward and success rates with reduced wall-clock time versus diffusion-based baselines, alongside systematic analyses of design choices and regularizations. The work demonstrates strong potential for practical, fast online adaptation of flow-based controllers in challenging, sparse-reward scenarios, while outlining future directions for sample efficiency and real-world scaling.
Abstract
We propose ReinFlow, a simple yet effective online reinforcement learning (RL) framework that fine-tunes a family of flow matching policies for continuous robotic control. Derived from rigorous RL theory, ReinFlow injects learnable noise into a flow policy's deterministic path, converting the flow into a discrete-time Markov Process for exact and straightforward likelihood computation. This conversion facilitates exploration and ensures training stability, enabling ReinFlow to fine-tune diverse flow model variants, including Rectified Flow [35] and Shortcut Models [19], particularly at very few or even one denoising step. We benchmark ReinFlow in representative locomotion and manipulation tasks, including long-horizon planning with visual input and sparse reward. The episode reward of Rectified Flow policies obtained an average net growth of 135.36% after fine-tuning in challenging legged locomotion tasks while saving denoising steps and 82.63% of wall time compared to state-of-the-art diffusion RL fine-tuning method DPPO [43]. The success rate of the Shortcut Model policies in state and visual manipulation tasks achieved an average net increase of 40.34% after fine-tuning with ReinFlow at four or even one denoising step, whose performance is comparable to fine-tuned DDIM policies while saving computation time for an average of 23.20%. Project webpage: https://reinflow.github.io/
