$π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

Kang Chen; Zhihao Liu; Tonghe Zhang; Zhen Guo; Si Xu; Hao Lin; Hongzhi Zang; Xiang Li; Quanlu Zhang; Zhaofei Yu; Guoliang Fan; Tiejun Huang; Yu Wang; Chao Yu

$π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

Kang Chen, Zhihao Liu, Tonghe Zhang, Zhen Guo, Si Xu, Hao Lin, Hongzhi Zang, Xiang Li, Quanlu Zhang, Zhaofei Yu, Guoliang Fan, Tiejun Huang, Yu Wang, Chao Yu

TL;DR

This work tackles the challenge of online reinforcement learning fine-tuning for flow-based Vision-Language-Action models, whose denoising-based action generation makes log-likelihood computation intractable. The authors introduce π_RL, a framework with Flow-Noise and Flow-SDE to enable exact likelihood estimation and efficient exploration via two distinct MDP formulations and PPO-based policy optimization. Across LIBERO, ManiSkill, and MetaWorld, π_RL yields substantial improvements over SFT baselines and demonstrates strong generalization to multi-task settings, with ablations clarifying design choices. The results push forward practical RL-based fine-tuning for high-frequency, flow-based robotic control, and the open-source release supports reproducibility and broader adoption.

Abstract

Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying large-scale RL to flow-based VLAs (\eg, $π_0$, $π_{0.5}$) remains challenging due to intractable action log-likelihoods from iterative denoising. We address this challenge with $π_{\texttt{RL}}$, an open-source framework for training flow-based VLAs in parallel simulation. $π_{\texttt{RL}}$ implements two RL algorithms: (1) \textbf{Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) \textbf{Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate $π_{\texttt{RL}}$ on LIBERO, ManiSkill, and MetaWorld benchmarks. On LIBERO, $π_{\texttt{RL}}$ boosts few-shot SFT models $π_0$ and $π_{0.5}$ from 57.6\% to 97.6\% and from 77.1\% to 98.3\%, respectively. On ManiSkill, we train $π_{\texttt{RL}}$ in 320 parallel environments, improving $π_0$ from 38.4\% to 78.8\% and $π_{0.5}$ from 40.1\% to 90.8\% across 4352 variations of pick-and-place task. On MetaWorld, RL is conducted over 50 different manipulation tasks and yields performance gains of 35.0\% and 26.9\% for $π_0$ and $π_{0.5}$ models, respectively. Overall, $π_{\texttt{RL}}$ achieves significant performance gains and stronger generalization over SFT-models, validating the effectiveness of online RL for flow-based VLAs.

$π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

TL;DR

Abstract

) remains challenging due to intractable action log-likelihoods from iterative denoising. We address this challenge with

, an open-source framework for training flow-based VLAs in parallel simulation.

implements two RL algorithms: (1) \textbf{Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) \textbf{Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate

on LIBERO, ManiSkill, and MetaWorld benchmarks. On LIBERO,

boosts few-shot SFT models

and

from 57.6\% to 97.6\% and from 77.1\% to 98.3\%, respectively. On ManiSkill, we train

in 320 parallel environments, improving

from 38.4\% to 78.8\% and

from 40.1\% to 90.8\% across 4352 variations of pick-and-place task. On MetaWorld, RL is conducted over 50 different manipulation tasks and yields performance gains of 35.0\% and 26.9\% for

and

models, respectively. Overall,

achieves significant performance gains and stronger generalization over SFT-models, validating the effectiveness of online RL for flow-based VLAs.

$π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

TL;DR

Abstract

$π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)