Table of Contents
Fetching ...

Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models

Mingyang Lyu, Yinqian Sun, Erliang Lin, Huangrui Li, Ruolin Chen, Feifei Zhao, Yi Zeng

TL;DR

Vision-Language-Action models pretrained via demonstrations are powerful but constrained by data quality; online reinforcement fine-tuning is hindered by intractable density ratios for flow-based policies. Flow Policy Optimization (FPO) introduces a likelihood-free policy ratio derived from per-sample changes in the conditional flow-matching objective, combined with structure-aware credit assignment, a clipped surrogate, multi-step latent exploration, and a Q-ensemble to enable stable online RL of the $\pi_0$ policy. Across LIBERO and ALOHA benchmarks, $\pi_0$-FPO surpasses six strong baselines, with ablations validating each component and latent-space analyses revealing a shift from broad exploration to focused, high-value control. The approach demonstrates practical, scalable online adaptation for high-frequency, long-horizon visuomotor control in sparse-reward, contact-rich tasks, enabling stronger generalization beyond imitation data.

Abstract

Vision-Language-Action (VLA) models such as OpenVLA, Octo, and $π_0$ have shown strong generalization by leveraging large-scale demonstrations, yet their performance is still fundamentally constrained by the quality and coverage of supervised data. Reinforcement learning (RL) provides a promising path for improving and fine-tuning VLAs through online interaction. However, conventional policy gradient methods are computationally infeasible in the context of flow-matching based models due to the intractability of the importance sampling process, which requires explicit computation of policy ratios. To overcome this limitation, we propose Flow Policy Optimization (FPO) algorithm, which reformulates importance sampling by leveraging per-sample changes in the conditional flow-matching objective. Furthermore, FPO achieves stable and scalable online reinforcement fine-tuning of the $π_0$ model by integrating structure-aware credit assignment to enhance gradient efficiency, clipped surrogate objectives to stabilize optimization, multi-step latent exploration to encourage diverse policy updates, and a Q-ensemble mechanism to provide robust value estimation. We evaluate FPO on the LIBERO benchmark and the ALOHA simulation task against supervised, preference-aligned, diffusion-based, autoregressive online RL, and $π_0$-FAST baselines, observing consistent improvements over the imitation prior and strong alternatives with stable learning under sparse rewards. In addition, ablation studies and analyses of the latent space dynamics further highlight the contributions of individual components within FPO, validating the effectiveness of the proposed computational modules and the stable convergence of the conditional flow-matching objective during online RL.

Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models

TL;DR

Vision-Language-Action models pretrained via demonstrations are powerful but constrained by data quality; online reinforcement fine-tuning is hindered by intractable density ratios for flow-based policies. Flow Policy Optimization (FPO) introduces a likelihood-free policy ratio derived from per-sample changes in the conditional flow-matching objective, combined with structure-aware credit assignment, a clipped surrogate, multi-step latent exploration, and a Q-ensemble to enable stable online RL of the policy. Across LIBERO and ALOHA benchmarks, -FPO surpasses six strong baselines, with ablations validating each component and latent-space analyses revealing a shift from broad exploration to focused, high-value control. The approach demonstrates practical, scalable online adaptation for high-frequency, long-horizon visuomotor control in sparse-reward, contact-rich tasks, enabling stronger generalization beyond imitation data.

Abstract

Vision-Language-Action (VLA) models such as OpenVLA, Octo, and have shown strong generalization by leveraging large-scale demonstrations, yet their performance is still fundamentally constrained by the quality and coverage of supervised data. Reinforcement learning (RL) provides a promising path for improving and fine-tuning VLAs through online interaction. However, conventional policy gradient methods are computationally infeasible in the context of flow-matching based models due to the intractability of the importance sampling process, which requires explicit computation of policy ratios. To overcome this limitation, we propose Flow Policy Optimization (FPO) algorithm, which reformulates importance sampling by leveraging per-sample changes in the conditional flow-matching objective. Furthermore, FPO achieves stable and scalable online reinforcement fine-tuning of the model by integrating structure-aware credit assignment to enhance gradient efficiency, clipped surrogate objectives to stabilize optimization, multi-step latent exploration to encourage diverse policy updates, and a Q-ensemble mechanism to provide robust value estimation. We evaluate FPO on the LIBERO benchmark and the ALOHA simulation task against supervised, preference-aligned, diffusion-based, autoregressive online RL, and -FAST baselines, observing consistent improvements over the imitation prior and strong alternatives with stable learning under sparse rewards. In addition, ablation studies and analyses of the latent space dynamics further highlight the contributions of individual components within FPO, validating the effectiveness of the proposed computational modules and the stable convergence of the conditional flow-matching objective during online RL.

Paper Structure

This paper contains 28 sections, 7 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: An overview of the challenging visuomotor control environments used in our evaluation: the bimanual ALOHA Transfer Cube task and several multi-object manipulation tasks from the LIBERO suite. These environments require a combination of long-horizon reasoning, precise control, and generalization across different objects and initial conditions.
  • Figure 2: LIBERO-Long simulation: online fine-tuning curves on a representative task.
  • Figure 3: FPO online learning on the ALOHA Transfer Cube task. (a) Policy evolution at 0/0.8M/1.6M training steps: the baseline side-grasp failure mode is corrected to a robust top-down grasp that consistently completes the task. (b) Success rate (SR) curve: the smoothed trajectory (purple) steadily improves, surpassing the 40$\%$ baseline (red dashed) and reaching 65$\%$, mirroring the behavioural change in (a).
  • Figure 4: FPO latent action space evolution. Visualized via t-SNE, this figure shows the policy's latent action distribution transitioning from broad exploration to focused exploitation across training stages. (a) Initial policy: wide, high-variance exploration. (b) Breakthrough phase: distribution concentrates around successful sequences. (c) Late-Training Phase: highly focused, low-variance exploitation of optimal regions. (d) Bar chart: quantifies reduced exploration range and dispersion, confirming convergence to refined behaviors.
  • Figure 5: FPO's ability to correct suboptimal behaviors. (Top) The SFT baseline policy consistently fails the task due to a suboptimal grasping approach inherited from the imitation prior. (Bottom) After online fine-tuning with FPO, the policy discovers a novel and successful trajectory from the same initial state, showcasing effective online correction.
  • ...and 1 more figures