Table of Contents
Fetching ...

$π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

Kang Chen, Zhihao Liu, Tonghe Zhang, Zhen Guo, Si Xu, Hao Lin, Hongzhi Zang, Xiang Li, Quanlu Zhang, Zhaofei Yu, Guoliang Fan, Tiejun Huang, Yu Wang, Chao Yu

TL;DR

This work tackles the challenge of online reinforcement learning fine-tuning for flow-based Vision-Language-Action models, whose denoising-based action generation makes log-likelihood computation intractable. The authors introduce π_RL, a framework with Flow-Noise and Flow-SDE to enable exact likelihood estimation and efficient exploration via two distinct MDP formulations and PPO-based policy optimization. Across LIBERO, ManiSkill, and MetaWorld, π_RL yields substantial improvements over SFT baselines and demonstrates strong generalization to multi-task settings, with ablations clarifying design choices. The results push forward practical RL-based fine-tuning for high-frequency, flow-based robotic control, and the open-source release supports reproducibility and broader adoption.

Abstract

Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying large-scale RL to flow-based VLAs (\eg, $π_0$, $π_{0.5}$) remains challenging due to intractable action log-likelihoods from iterative denoising. We address this challenge with $π_{\texttt{RL}}$, an open-source framework for training flow-based VLAs in parallel simulation. $π_{\texttt{RL}}$ implements two RL algorithms: (1) \textbf{Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) \textbf{Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate $π_{\texttt{RL}}$ on LIBERO, ManiSkill, and MetaWorld benchmarks. On LIBERO, $π_{\texttt{RL}}$ boosts few-shot SFT models $π_0$ and $π_{0.5}$ from 57.6\% to 97.6\% and from 77.1\% to 98.3\%, respectively. On ManiSkill, we train $π_{\texttt{RL}}$ in 320 parallel environments, improving $π_0$ from 38.4\% to 78.8\% and $π_{0.5}$ from 40.1\% to 90.8\% across 4352 variations of pick-and-place task. On MetaWorld, RL is conducted over 50 different manipulation tasks and yields performance gains of 35.0\% and 26.9\% for $π_0$ and $π_{0.5}$ models, respectively. Overall, $π_{\texttt{RL}}$ achieves significant performance gains and stronger generalization over SFT-models, validating the effectiveness of online RL for flow-based VLAs.

$π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

TL;DR

This work tackles the challenge of online reinforcement learning fine-tuning for flow-based Vision-Language-Action models, whose denoising-based action generation makes log-likelihood computation intractable. The authors introduce π_RL, a framework with Flow-Noise and Flow-SDE to enable exact likelihood estimation and efficient exploration via two distinct MDP formulations and PPO-based policy optimization. Across LIBERO, ManiSkill, and MetaWorld, π_RL yields substantial improvements over SFT baselines and demonstrates strong generalization to multi-task settings, with ablations clarifying design choices. The results push forward practical RL-based fine-tuning for high-frequency, flow-based robotic control, and the open-source release supports reproducibility and broader adoption.

Abstract

Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying large-scale RL to flow-based VLAs (\eg, , ) remains challenging due to intractable action log-likelihoods from iterative denoising. We address this challenge with , an open-source framework for training flow-based VLAs in parallel simulation. implements two RL algorithms: (1) \textbf{Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) \textbf{Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate on LIBERO, ManiSkill, and MetaWorld benchmarks. On LIBERO, boosts few-shot SFT models and from 57.6\% to 97.6\% and from 77.1\% to 98.3\%, respectively. On ManiSkill, we train in 320 parallel environments, improving from 38.4\% to 78.8\% and from 40.1\% to 90.8\% across 4352 variations of pick-and-place task. On MetaWorld, RL is conducted over 50 different manipulation tasks and yields performance gains of 35.0\% and 26.9\% for and models, respectively. Overall, achieves significant performance gains and stronger generalization over SFT-models, validating the effectiveness of online RL for flow-based VLAs.

Paper Structure

This paper contains 40 sections, 16 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: Overview of $\pi_{\texttt{RL}}$. $\pi_{\texttt{RL}}$, an online RL framework featuring Flow-Noise and Flow-SDE two approaches, is designed to enhance the performance and generalization of SFT-aligned flow-based VLAs, represented by the $\pi_0$ and $\pi_{0.5}$. Experiments conducted on LIBERO, ManiSkill, and MetaWorld benchmarks demonstrate that $\pi_{\texttt{RL}}$ achieves significant gains over SFT models.
  • Figure 2: Two optimization methods in $\pi_{\texttt{RL}}$. Flow-Noise adds learnable noise in a one-layer MDP (\ref{['fig:noise_injection']}), using the denoised joint likelihood for policy gradient. Flow-SDE builds a two-layer MDP with ODE-to-SDE conversion, and computes the likelihood directly.
  • Figure 3: Illustration for the noise injection on the flow matching, exemplified by $\pi_{0.5}$, which integrates image, language, and state information for unified VLM input.
  • Figure 4: Illustration of the two critic placement configurations.
  • Figure 5: Visual comparison of PPO and GRPO with Flow-SDE $\pi_0$ on the LIBERO, demonstrating that PPO outperforms GRPO in terms of convergence performance and training speed.
  • ...and 11 more figures