Table of Contents
Fetching ...

TGRPO :Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization

Zengjue Chen, Runliang Niu, He Kong, Qi Wang, Qianli Xing, Zipei Fan

TL;DR

VLA models trained via supervised fine-tuning struggle with out-of-distribution and long-horizon tasks. The paper introduces TGRPO, an online RL framework that uses LLM-generated dense, multi-stage rewards and a trajectory- and step-level group-relative policy optimization to update VLA policies without a value network. The approach achieves an average 80.7% success rate on the LIBERO benchmark, outperforming SFT and RL baselines, with ablations confirming the necessity of both advantage levels and a balanced group size. This work offers a scalable, failure-aware pathway for adaptive VLA fine-tuning in dynamic robotic environments.

Abstract

Visual-Language-Action (VLA) models have demonstrated strong cross-scenario generalization capabilities in various robotic tasks through large-scale pre-training and task-specific fine-tuning. However, their training paradigm mainly relies on manually collected successful demonstrations, making it difficult to adapt to complex environments when encountering out-of-distribution (OOD) scenarios or execution biases. While Reinforcement Learning (RL) provides a closed-loop optimization framework via active trial-and-error mechanism, it suffers from sparse rewards, high variance, and unstable optimization in long-horizon robotic tasks. To address these limitations, we propose Trajectory-based Group Relative Policy Optimization (TGRPO), an online RL-based training framework for VLA models. TGRPO leverages task analysis generated by a large language model to automatically construct dense reward functions, providing fine-grained feedback to accelerate convergence and improve credit assignment. The core of our method is a group-based strategy that samples and normalizes multiple trajectories in parallel, reducing variance through relative comparison. By integrating trajectory-level and step-level advantage estimation, TGRPO captures both global and local optimization signals without relying on a value network. Experiments on four task categories of the LIBERO benchmark demonstrate that TGRPO achieves an average success rate of 80.7\%, which is 4.2\% higher than that of Supervised Fine-Tuning (SFT) and outperforms other representative RL-based post-training methods.

TGRPO :Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization

TL;DR

VLA models trained via supervised fine-tuning struggle with out-of-distribution and long-horizon tasks. The paper introduces TGRPO, an online RL framework that uses LLM-generated dense, multi-stage rewards and a trajectory- and step-level group-relative policy optimization to update VLA policies without a value network. The approach achieves an average 80.7% success rate on the LIBERO benchmark, outperforming SFT and RL baselines, with ablations confirming the necessity of both advantage levels and a balanced group size. This work offers a scalable, failure-aware pathway for adaptive VLA fine-tuning in dynamic robotic environments.

Abstract

Visual-Language-Action (VLA) models have demonstrated strong cross-scenario generalization capabilities in various robotic tasks through large-scale pre-training and task-specific fine-tuning. However, their training paradigm mainly relies on manually collected successful demonstrations, making it difficult to adapt to complex environments when encountering out-of-distribution (OOD) scenarios or execution biases. While Reinforcement Learning (RL) provides a closed-loop optimization framework via active trial-and-error mechanism, it suffers from sparse rewards, high variance, and unstable optimization in long-horizon robotic tasks. To address these limitations, we propose Trajectory-based Group Relative Policy Optimization (TGRPO), an online RL-based training framework for VLA models. TGRPO leverages task analysis generated by a large language model to automatically construct dense reward functions, providing fine-grained feedback to accelerate convergence and improve credit assignment. The core of our method is a group-based strategy that samples and normalizes multiple trajectories in parallel, reducing variance through relative comparison. By integrating trajectory-level and step-level advantage estimation, TGRPO captures both global and local optimization signals without relying on a value network. Experiments on four task categories of the LIBERO benchmark demonstrate that TGRPO achieves an average success rate of 80.7\%, which is 4.2\% higher than that of Supervised Fine-Tuning (SFT) and outperforms other representative RL-based post-training methods.

Paper Structure

This paper contains 13 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Problem: VLA models trained by SFT generalize poorly and fail in unseen scenarios. Method: TGRPO uses LLM-generated rewards and step–trajectory group fusion for stable policy learning. Result: On the LIBERO benchmark, TGRPO outperforms SFT and GRAPE.
  • Figure 2: Overview of the proposed Trajectory-wise Group Relative Policy Optimization. Given natural language instructions and multimodal observations, the OpenVLA model produces action tokens to control the robot. Trajectories sampled in parallel across environments are evaluated with a multi-stage reward function, grouped for step-level and trajectory-level advantage estimation, and fused to update the policy.
  • Figure 3: Success rates of different methods on the LIBERO benchmark suites. Our proposed TGRPO consistently outperforms baseline methods (Octo, SFT, DPO, GRAPE) across all task categories (Spatial, Object, Goal, Long), achieving the highest average success rate of 80.7%.
  • Figure 4: An example from LIBERO-Long. A successful trajectory trained with TGRPO. At each sampling step, rewards are assigned based on the states of key objects and the robot. The illustrated task is "put both the alphabet soup and the tomato sauce in the basket", where the key objects are alphabet soup, tomato sauce, and basket.
  • Figure 5: Hyperparameter analysis of TGRPO on the LIBERO-Goal suite. We evaluate the average success rate across different settings of the weighting coefficients $(\alpha_1,\alpha_2)$, which balance step-level and trajectory-level advantages. The best performance is achieved at $(0.3,0.7)$, yielding an average success rate of 81.0%.