Table of Contents
Fetching ...

$π^{*}_{0.6}$: a VLA That Learns From Experience

Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachy Groom, Hunter Hancock, Karol Hausman, Gashon Hussein, Brian Ichter, Szymon Jakubczak, Rowan Jen, Tim Jones, Ben Katz, Liyiming Ke, Chandra Kuchi, Marinda Lamb, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Yao Lu, Vishnu Mano, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Charvi Sharma, Lucy Xiaoyang Shi, Laura Smith, Jost Tobias Springenberg, Kyle Stachowicz, Will Stoeckle, Alex Swerdlow, James Tanner, Marcel Torne, Quan Vuong, Anna Walling, Haohuan Wang, Blake Williams, Sukwon Yoo, Lili Yu, Ury Zhilinsky, Zhiyuan Zhou

TL;DR

<p>We address the challenge of making vision-language-action (VLA) policies robust and efficient in real-world robotic deployment by proposing RECAP, a reinforcement-learning framework that learns from diverse data sources, including demonstrations, autonomous experience, and expert interventions. RECAP pre-trains a generalist VLA, $π^{*}_{0.6}$, via offline RL and then iteratively improves it with on-robot data using an advantage-conditioned policy extraction mechanism that leverages a distributional value function. The approach combines a distributional value predictor, Bayes-inspired policy reweighting, and a flow-matching action head to handle high-capacity VLA models, achieving substantial gains in throughput and success across laundry-folding, espresso-making, and box-assembly tasks—up to 2× throughput and roughly 2× fewer failures. The work demonstrates that a carefully designed RL recipe can significantly boost robustness and efficiency of generalist VLA policies in real-world settings, paving the way for more autonomous and scalable robotic learning. </p>

Abstract

We study how vision-language-action (VLA) models can improve through real-world deployments via reinforcement learning (RL). We present a general-purpose method, RL with Experience and Corrections via Advantage-conditioned Policies (RECAP), that provides for RL training of VLAs via advantage conditioning. Our method incorporates heterogeneous data into the self-improvement process, including demonstrations, data from on-policy collection, and expert teleoperated interventions provided during autonomous execution. RECAP starts by pre-training a generalist VLA with offline RL, which we call $π^{*}_{0.6}$, that can then be specialized to attain high performance on downstream tasks through on-robot data collection. We show that the $π^{*}_{0.6}$ model trained with the full RECAP method can fold laundry in real homes, reliably assemble boxes, and make espresso drinks using a professional espresso machine. On some of the hardest tasks, RECAP more than doubles task throughput and roughly halves the task failure rate.

$π^{*}_{0.6}$: a VLA That Learns From Experience

TL;DR

<p>We address the challenge of making vision-language-action (VLA) policies robust and efficient in real-world robotic deployment by proposing RECAP, a reinforcement-learning framework that learns from diverse data sources, including demonstrations, autonomous experience, and expert interventions. RECAP pre-trains a generalist VLA, , via offline RL and then iteratively improves it with on-robot data using an advantage-conditioned policy extraction mechanism that leverages a distributional value function. The approach combines a distributional value predictor, Bayes-inspired policy reweighting, and a flow-matching action head to handle high-capacity VLA models, achieving substantial gains in throughput and success across laundry-folding, espresso-making, and box-assembly tasks—up to 2× throughput and roughly 2× fewer failures. The work demonstrates that a carefully designed RL recipe can significantly boost robustness and efficiency of generalist VLA policies in real-world settings, paving the way for more autonomous and scalable robotic learning. </p>

Abstract

We study how vision-language-action (VLA) models can improve through real-world deployments via reinforcement learning (RL). We present a general-purpose method, RL with Experience and Corrections via Advantage-conditioned Policies (RECAP), that provides for RL training of VLAs via advantage conditioning. Our method incorporates heterogeneous data into the self-improvement process, including demonstrations, data from on-policy collection, and expert teleoperated interventions provided during autonomous execution. RECAP starts by pre-training a generalist VLA with offline RL, which we call , that can then be specialized to attain high performance on downstream tasks through on-robot data collection. We show that the model trained with the full RECAP method can fold laundry in real homes, reliably assemble boxes, and make espresso drinks using a professional espresso machine. On some of the hardest tasks, RECAP more than doubles task throughput and roughly halves the task failure rate.

Paper Structure

This paper contains 27 sections, 15 equations, 12 figures, 1 algorithm.

Figures (12)

  • Figure 2: Some of the tasks learned by Recap.$\pi^{*}_{0.6}$ trained with Recap can make espresso drinks, assemble cardboard boxes, and fold diverse and realistic laundry with a high success rate. Each task involves realistic variability -- flattened unfolded boxes stick together and bend, making espresso drinks requires pouring liquids, and folding laundry requires generalization to a wide range of clothing items.
  • Figure 3: Interaction between the $\pi^{*}_{0.6}$ VLA and value function during Recap training. The $\pi^{*}_{0.6}$ VLA uses a pre-trained VLM backbone. Training follows the KI recipe driess2025knowledge, with next-token prediction on many data sources in pre-training, and an flow-matching action-expert with stop gradient. The VLA is conditioned on a binarized advantage indicator, obtained from a separate value function initialized from a pre-trained but smaller VLM model.
  • Figure 4: Visualization of the value functions. We train a multi-task value function to predict the number of steps to success, normalized by maximum task length to $(-1, 0)$, where $0$ corresponds to successful completion. We visualize the value function output on a folding task that finished successfully (left), and an unsuccessful example of a manipulation task from the pre-training dataset (right). The red parts highlight a drop in value, and green parts highlight increases; images on top show the corresponding frames of the episode. The visualization shows that the VF correctly identifies mistakes in the episode, as well as the speed of progress.
  • Figure 5: The robot setup used in our experiments.$\pi^{*}_{0.6}$ is trained on data from many different robots in pre-training. For the iterative improvement experiments, we use a static bimanual system with two 6 DoF arms with parallel jaw grippers. The arms are controlled at 50 Hz with joint positions. Observations consist of joint and gripper positions, as well as images from three cameras: a base camera mounted between the arms, and a wrist-mounted camera on each arm. The setup can be mounted flexibly, e.g. on a table.
  • Figure 6: Illustrations of the tasks used in our experiments. Tasks include three different laundry variants, assembling boxes, and making coffee drinks with an espresso machine.
  • ...and 7 more figures