Table of Contents
Fetching ...

STARE-VLA: Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models

Feng Xu, Guangyao Zhai, Xin Kong, Tingzhong Fu, Daniel F. N. Gordon, Xueli An, Benjamin Busam

TL;DR

The paper introduces Stage-Aware Reinforcement (StARe) to decompose long-horizon Vision-Language-Action tasks into semantically meaningful stages, enabling dense, stage-aligned reinforcement signals. It then develops offline Stage-Aware Trajectory Preference Optimization (StA-TPO) and online Stage-Aware PPO (StA-PPO) to provide fine-grained credit assignment and progressive learning. Integrated with supervised fine-tuning in the Imitation -> Preference -> Interaction (IPI) pipeline, the approach achieves state-of-the-art results on SimplerEnv and ManiSkill3, with substantial gains in both in-distribution and out-of-distribution performance. The work demonstrates that stage-wise objectives and potential-based intra-stage rewards can dramatically improve stability and sample efficiency for VLA fine-tuning in long-horizon robotic manipulation.

Abstract

Recent advances in Vision-Language-Action (VLA) models, powered by large language models and reinforcement learning-based fine-tuning, have shown remarkable progress in robotic manipulation. Existing methods often treat long-horizon actions as linguistic sequences and apply trajectory-level optimization methods such as Trajectory-wise Preference Optimization (TPO) or Proximal Policy Optimization (PPO), leading to coarse credit assignment and unstable training. However, unlike language, where a unified semantic meaning is preserved despite flexible sentence order, action trajectories progress through causally chained stages with different learning difficulties. This motivates progressive stage optimization. Thereby, we present Stage-Aware Reinforcement (STARE), a module that decomposes a long-horizon action trajectory into semantically meaningful stages and provides dense, interpretable, and stage-aligned reinforcement signals. Integrating STARE into TPO and PPO, we yield Stage-Aware TPO (STA-TPO) and Stage-Aware PPO (STA-PPO) for offline stage-wise preference and online intra-stage interaction, respectively. Further building on supervised fine-tuning as initialization, we propose the Imitation -> Preference -> Interaction (IPI), a serial fine-tuning pipeline for improving action accuracy in VLA models. Experiments on SimplerEnv and ManiSkill3 demonstrate substantial gains, achieving state-of-the-art success rates of 98.0 percent on SimplerEnv and 96.4 percent on ManiSkill3 tasks.

STARE-VLA: Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models

TL;DR

The paper introduces Stage-Aware Reinforcement (StARe) to decompose long-horizon Vision-Language-Action tasks into semantically meaningful stages, enabling dense, stage-aligned reinforcement signals. It then develops offline Stage-Aware Trajectory Preference Optimization (StA-TPO) and online Stage-Aware PPO (StA-PPO) to provide fine-grained credit assignment and progressive learning. Integrated with supervised fine-tuning in the Imitation -> Preference -> Interaction (IPI) pipeline, the approach achieves state-of-the-art results on SimplerEnv and ManiSkill3, with substantial gains in both in-distribution and out-of-distribution performance. The work demonstrates that stage-wise objectives and potential-based intra-stage rewards can dramatically improve stability and sample efficiency for VLA fine-tuning in long-horizon robotic manipulation.

Abstract

Recent advances in Vision-Language-Action (VLA) models, powered by large language models and reinforcement learning-based fine-tuning, have shown remarkable progress in robotic manipulation. Existing methods often treat long-horizon actions as linguistic sequences and apply trajectory-level optimization methods such as Trajectory-wise Preference Optimization (TPO) or Proximal Policy Optimization (PPO), leading to coarse credit assignment and unstable training. However, unlike language, where a unified semantic meaning is preserved despite flexible sentence order, action trajectories progress through causally chained stages with different learning difficulties. This motivates progressive stage optimization. Thereby, we present Stage-Aware Reinforcement (STARE), a module that decomposes a long-horizon action trajectory into semantically meaningful stages and provides dense, interpretable, and stage-aligned reinforcement signals. Integrating STARE into TPO and PPO, we yield Stage-Aware TPO (STA-TPO) and Stage-Aware PPO (STA-PPO) for offline stage-wise preference and online intra-stage interaction, respectively. Further building on supervised fine-tuning as initialization, we propose the Imitation -> Preference -> Interaction (IPI), a serial fine-tuning pipeline for improving action accuracy in VLA models. Experiments on SimplerEnv and ManiSkill3 demonstrate substantial gains, achieving state-of-the-art success rates of 98.0 percent on SimplerEnv and 96.4 percent on ManiSkill3 tasks.

Paper Structure

This paper contains 71 sections, 39 equations, 6 figures, 3 tables, 2 algorithms.

Figures (6)

  • Figure 1: Language Reasoning vs. Action Reasoning. Given an RGB image as the observation (a), the language model (b) is asked to describe the content in the image, and produces Sentence 1 and Sentence 2. These sentences are flexibly ordered and jointly contribute to the global meaning required to answer the question. In contrast, the VLA model (c), when instructed to place the cup onto the plate, generates an action trajectory composed of semantically meaningful stages (Reach$\to$Grasp$\to$Transport$\to$Place), which follow a strict order and vary in difficulty (with the more challenging stages shown in bold).
  • Figure 2: Overview of the STARE Framework and Its Integration into the IPI Training Pipeline.
  • Figure 3: Two simulated benchmarks. We show experiment setups and example tasks involved.
  • Figure 4: Comparison of learning curves across eight representative tasks from SimplerEnv and ManiSkill3. The y-axis denotes the success rate, and the x-axis shows the interaction environment steps (in thousands).
  • Figure 5: Offline Stage-wise ablation on two tasks. We report stage completion rates (%) for StackGreenonYellow (SimplerEnv) and LiftPegUpright (ManiSkill3). Compared with TPO, StA-TPO achieves significant gains, particularly in the grasp and place/upright stages, which are critical for final success.
  • ...and 1 more figures