Table of Contents
Fetching ...

VLA Model Post-Training via Action-Chunked PPO and Self Behavior Cloning

Si-Cheng Wang, Tian-Yu Xiang, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Ao-Qun Jin, Zeng-Guang Hou

TL;DR

This work tackles the challenge of post-training reinforcement learning for vision–language–action (VLA) models in environments with sparse rewards and unstable training. It introduces action-chunked proximal policy optimization (PPO) combined with a self-generated demonstration buffer and an online, adaptive weighting scheme that blends PPO with a self behavior cloning loss. On MetaWorld MT10, the approach achieves a higher average success rate (0.93) and faster task-completion (42.17 steps) than supervised fine-tuning with both small and larger demonstration sets, while also yielding shorter, more efficient trajectories. The results demonstrate the viability of RL for VLA post-training and provide a scalable, continual-learning framework with distinct components for feedback density, demonstration quality, and training stability, with potential impact on downstream VLA deployment.

Abstract

Reinforcement learning (RL) is a promising avenue for post-training vision-language-action (VLA) models, but practical deployment is hindered by sparse rewards and unstable training. This work mitigates these challenges by introducing an action chunk based on proximal policy optimization (PPO) with behavior cloning using self-collected demonstrations. Aggregating consecutive actions into chunks improves the temporal consistency of the policy and the density of informative feedback. In addition, an auxiliary behavior cloning loss is applied with a dynamically updated demonstration buffer that continually collects high-quality task trials during training. The relative weight between the action-chunked PPO objective and the self behavior clone auxiliary loss is adapted online to stabilize the post-training process. Experiments on the MetaWorld benchmark indicate improved performance over supervised fine-tuning, achieving a high success rate (0.93) and few steps to success (42.17). These results demonstrate the viability of RL for VLA post-training and help lay the groundwork for downstream VLA applications.

VLA Model Post-Training via Action-Chunked PPO and Self Behavior Cloning

TL;DR

This work tackles the challenge of post-training reinforcement learning for vision–language–action (VLA) models in environments with sparse rewards and unstable training. It introduces action-chunked proximal policy optimization (PPO) combined with a self-generated demonstration buffer and an online, adaptive weighting scheme that blends PPO with a self behavior cloning loss. On MetaWorld MT10, the approach achieves a higher average success rate (0.93) and faster task-completion (42.17 steps) than supervised fine-tuning with both small and larger demonstration sets, while also yielding shorter, more efficient trajectories. The results demonstrate the viability of RL for VLA post-training and provide a scalable, continual-learning framework with distinct components for feedback density, demonstration quality, and training stability, with potential impact on downstream VLA deployment.

Abstract

Reinforcement learning (RL) is a promising avenue for post-training vision-language-action (VLA) models, but practical deployment is hindered by sparse rewards and unstable training. This work mitigates these challenges by introducing an action chunk based on proximal policy optimization (PPO) with behavior cloning using self-collected demonstrations. Aggregating consecutive actions into chunks improves the temporal consistency of the policy and the density of informative feedback. In addition, an auxiliary behavior cloning loss is applied with a dynamically updated demonstration buffer that continually collects high-quality task trials during training. The relative weight between the action-chunked PPO objective and the self behavior clone auxiliary loss is adapted online to stabilize the post-training process. Experiments on the MetaWorld benchmark indicate improved performance over supervised fine-tuning, achieving a high success rate (0.93) and few steps to success (42.17). These results demonstrate the viability of RL for VLA post-training and help lay the groundwork for downstream VLA applications.

Paper Structure

This paper contains 14 sections, 8 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Pipeline of the proposed method. (a) The actor–critic architecture. (b) Post-training under hybrid action-chunked PPO and self-behavior cloning.
  • Figure 2: Smoothed performance curve of the ablation study on the MetaWorld Push task. (a) Effect of action chunk in PPO. (b) Effect of the demonstration buffer.
  • Figure 3: Case study comparing supervised fine-tuning (10 demonstrations) and the proposed method, (a) window open, (b) drawer close