TGRPO :Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization
Zengjue Chen, Runliang Niu, He Kong, Qi Wang, Qianli Xing, Zipei Fan
TL;DR
VLA models trained via supervised fine-tuning struggle with out-of-distribution and long-horizon tasks. The paper introduces TGRPO, an online RL framework that uses LLM-generated dense, multi-stage rewards and a trajectory- and step-level group-relative policy optimization to update VLA policies without a value network. The approach achieves an average 80.7% success rate on the LIBERO benchmark, outperforming SFT and RL baselines, with ablations confirming the necessity of both advantage levels and a balanced group size. This work offers a scalable, failure-aware pathway for adaptive VLA fine-tuning in dynamic robotic environments.
Abstract
Visual-Language-Action (VLA) models have demonstrated strong cross-scenario generalization capabilities in various robotic tasks through large-scale pre-training and task-specific fine-tuning. However, their training paradigm mainly relies on manually collected successful demonstrations, making it difficult to adapt to complex environments when encountering out-of-distribution (OOD) scenarios or execution biases. While Reinforcement Learning (RL) provides a closed-loop optimization framework via active trial-and-error mechanism, it suffers from sparse rewards, high variance, and unstable optimization in long-horizon robotic tasks. To address these limitations, we propose Trajectory-based Group Relative Policy Optimization (TGRPO), an online RL-based training framework for VLA models. TGRPO leverages task analysis generated by a large language model to automatically construct dense reward functions, providing fine-grained feedback to accelerate convergence and improve credit assignment. The core of our method is a group-based strategy that samples and normalizes multiple trajectories in parallel, reducing variance through relative comparison. By integrating trajectory-level and step-level advantage estimation, TGRPO captures both global and local optimization signals without relying on a value network. Experiments on four task categories of the LIBERO benchmark demonstrate that TGRPO achieves an average success rate of 80.7\%, which is 4.2\% higher than that of Supervised Fine-Tuning (SFT) and outperforms other representative RL-based post-training methods.
