Table of Contents
Fetching ...

Self-Improving Vision-Language-Action Models with Data Generation via Residual RL

Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi "Jim" Fan, Guanya Shi, Yuke Zhu

TL;DR

PLD addresses the data and distribution challenges in post-training vision-language-action robotics by introducing a residual RL-based data-generation stage. The three-stage pipeline freezes the base VLA, learns lightweight residual specialists to probe failure regions, collects data through deployment-aligned hybrid rollouts, and distills successes back into the generalist via SFT. Empirically, PLD delivers near-saturation on LIBERO (~99%), substantial gains on SimplerEnv, and robust real-world performance on Franka and YAM dexterous tasks, with ablations confirming the importance of residual probing and distribution-aware replay. This work offers a scalable, autonomous pathway toward self-improving multi-embodiment VLA systems with reduced need for human demonstrations.

Abstract

Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (PLD), a three-stage plug-and-play framework that improves VLAs through residual reinforcement learning (RL) and distribution-aware data collection. In Stage 1, we train lightweight residual actors to probe failure regions of the VLA generalist. In Stage 2, we use a hybrid rollout scheme that aligns collected trajectories with the generalist's deployment distribution while capturing recovery behaviors. In Stage 3, we distill the curated trajectories back into the generalist with standard SFT. PLD achieves near-saturated 99% task success on LIBERO, over 50% gains in SimplerEnv, and 100% success on real-world Franka and YAM arm manipulation tasks. Ablations show that residual probing and distribution-aware replay are key to collecting deployment-aligned data that improves both seen and unseen tasks, offering a scalable path toward self-improving VLA models.

Self-Improving Vision-Language-Action Models with Data Generation via Residual RL

TL;DR

PLD addresses the data and distribution challenges in post-training vision-language-action robotics by introducing a residual RL-based data-generation stage. The three-stage pipeline freezes the base VLA, learns lightweight residual specialists to probe failure regions, collects data through deployment-aligned hybrid rollouts, and distills successes back into the generalist via SFT. Empirically, PLD delivers near-saturation on LIBERO (~99%), substantial gains on SimplerEnv, and robust real-world performance on Franka and YAM dexterous tasks, with ablations confirming the importance of residual probing and distribution-aware replay. This work offers a scalable, autonomous pathway toward self-improving multi-embodiment VLA systems with reduced need for human demonstrations.

Abstract

Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (PLD), a three-stage plug-and-play framework that improves VLAs through residual reinforcement learning (RL) and distribution-aware data collection. In Stage 1, we train lightweight residual actors to probe failure regions of the VLA generalist. In Stage 2, we use a hybrid rollout scheme that aligns collected trajectories with the generalist's deployment distribution while capturing recovery behaviors. In Stage 3, we distill the curated trajectories back into the generalist with standard SFT. PLD achieves near-saturated 99% task success on LIBERO, over 50% gains in SimplerEnv, and 100% success on real-world Franka and YAM arm manipulation tasks. Ablations show that residual probing and distribution-aware replay are key to collecting deployment-aligned data that improves both seen and unseen tasks, offering a scalable path toward self-improving VLA models.

Paper Structure

This paper contains 48 sections, 8 equations, 16 figures, 5 tables, 1 algorithm.

Figures (16)

  • Figure 2: Synergetic effect of PLD data. We fine-tune $\pi_0$ on subsets of LIBERO-90 with varying task coverage ratios, where each ratio (10–80%) indicates the fraction of distinct task instances included in training relative to the full 90-task distribution. For each ratio, we randomly sample 4 disjoint subsets of tasks and report the averaged results. The x-axis thus represents the degree of task coverage (not the number of trajectories), while the evaluation is always conducted on all 90 tasks. We compare different data formulations: PLD data yields the highest in-distribution performance while retaining the cross-task generalization property of high-quality human data. It further enables modest-level zero-shot transfer even when trained on only 10% of tasks (24.4% SR on unseen tasks), whereas the VLA fine-tuned on base-policy rollout data (0-1 REINFORCE) underperforms and fails to generalize. (Success rate numbers are reported in \ref{['tab:libero-90-seen-to-unseen']}.)
  • Figure 3: An overview of PLD. Our pipeline consists of three stages: 1) learning specialist residual policy for each task via online off-policy RL, with efficient exploration guided by a frozen VLA generalist; 2) Automatic generation of hybrid trajectories by having the VLA rollout for the first $t$ steps and let the specialist takeover to generate recovery data; 3) Supervised fine-tuning using collected multi-task PLD data; 4) Deploy the fine-tuned generalist to diverse manipulation tasks in zero-shot.
  • Figure 4: Visualization of Data diversity. We visualize PLD data with different base policy initialization probing horizons. Increasing probing horizon yields longer episodes and greater diversity among successful trials. This broader data support leads to improved fine-tuning performance, which eventually saturates.
  • Figure 5: Benchmarking Sample-Efficient RL Performance. We compare PLD with RL baseline algorithms that either leverage policy prior or data prior. We report mean rollout performance (Average return calculated within a sliding window of 100 episodes) and 95% CIs for 3 seeds across 8 manipulation tasks selected from LIBERO-90.
  • Figure 6: Short-to-long generalization.$\pi_0$ fine-tuned on LIBERO-90 and one-shot evaluated on LIBERO-10 long horizon tasks.
  • ...and 11 more figures