Table of Contents
Fetching ...

Autonomous Improvement of Instruction Following Skills via Foundation Models

Zhiyuan Zhou, Pranav Atreya, Abraham Lee, Homer Walke, Oier Mees, Sergey Levine

TL;DR

The paper tackles the high cost and limited scalability of human demonstrations for robotic instruction-following. It introduces SOAR, a modular framework that uses vision-language models to propose semantically meaningful tasks, diffusion-based image editing to generate subgoal images, and a goal-conditioned policy trained with hindsight relabeling to learn from autonomous data. By decoupling language understanding from control and leveraging a frozen VLM for success detection, SOAR achieves significant performance gains on unseen scenes and releases a large autonomous dataset (SOAR-Data). The results demonstrate the practicality of autonomous improvement for language-conditioned robotics and highlight the benefits of semantic grounding and self-supervised learning for robust real-world deployment.

Abstract

Intelligent instruction-following robots capable of improving from autonomously collected experience have the potential to transform robot learning: instead of collecting costly teleoperated demonstration data, large-scale deployment of fleets of robots can quickly collect larger quantities of autonomous data that can collectively improve their performance. However, autonomous improvement requires solving two key problems: (i) fully automating a scalable data collection procedure that can collect diverse and semantically meaningful robot data and (ii) learning from non-optimal, autonomous data with no human annotations. To this end, we propose a novel approach that addresses these challenges, allowing instruction-following policies to improve from autonomously collected data without human supervision. Our framework leverages vision-language models to collect and evaluate semantically meaningful experiences in new environments, and then utilizes a decomposition of instruction following tasks into (semantic) language-conditioned image generation and (non-semantic) goal reaching, which makes it significantly more practical to improve from this autonomously collected data without any human annotations. We carry out extensive experiments in the real world to demonstrate the effectiveness of our approach, and find that in a suite of unseen environments, the robot policy can be improved 2x with autonomously collected data. We open-source the code for our semantic autonomous improvement pipeline, as well as our autonomous dataset of 30.5K trajectories collected across five tabletop environments.

Autonomous Improvement of Instruction Following Skills via Foundation Models

TL;DR

The paper tackles the high cost and limited scalability of human demonstrations for robotic instruction-following. It introduces SOAR, a modular framework that uses vision-language models to propose semantically meaningful tasks, diffusion-based image editing to generate subgoal images, and a goal-conditioned policy trained with hindsight relabeling to learn from autonomous data. By decoupling language understanding from control and leveraging a frozen VLM for success detection, SOAR achieves significant performance gains on unseen scenes and releases a large autonomous dataset (SOAR-Data). The results demonstrate the practicality of autonomous improvement for language-conditioned robotics and highlight the benefits of semantic grounding and self-supervised learning for robust real-world deployment.

Abstract

Intelligent instruction-following robots capable of improving from autonomously collected experience have the potential to transform robot learning: instead of collecting costly teleoperated demonstration data, large-scale deployment of fleets of robots can quickly collect larger quantities of autonomous data that can collectively improve their performance. However, autonomous improvement requires solving two key problems: (i) fully automating a scalable data collection procedure that can collect diverse and semantically meaningful robot data and (ii) learning from non-optimal, autonomous data with no human annotations. To this end, we propose a novel approach that addresses these challenges, allowing instruction-following policies to improve from autonomously collected data without human supervision. Our framework leverages vision-language models to collect and evaluate semantically meaningful experiences in new environments, and then utilizes a decomposition of instruction following tasks into (semantic) language-conditioned image generation and (non-semantic) goal reaching, which makes it significantly more practical to improve from this autonomously collected data without any human annotations. We carry out extensive experiments in the real world to demonstrate the effectiveness of our approach, and find that in a suite of unseen environments, the robot policy can be improved 2x with autonomously collected data. We open-source the code for our semantic autonomous improvement pipeline, as well as our autonomous dataset of 30.5K trajectories collected across five tabletop environments.
Paper Structure (33 sections, 1 equation, 6 figures, 6 tables)

This paper contains 33 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: We introduce SOAR, an approach to autonomously improve instruction following policies by leveraging foundation models for large-scale autonomous data collection and self-improvement with no human in the loop. SOAR imports Internet-scale knowledge from pre-trained Vision-Language Models and image-editing Diffusion Models to guide autonomous data collection, and enables policy-improvement with a self-supervised objective on the autonomous data. We make available the $30.5$K autonomous trajectories collected with SOAR in SOAR-dataset.
  • Figure 2: Overview of the SOAR autonomous improvement pipeline: First, we equip the robot with a set of basic skill by pre-training. Then, we deploy the pre-trained policy on a fleet of five robots to autonomously collect data, with a VLM proposing viable language tasks to practice. Specifically, the language task is turned into a subgoal image via an pre-trained image-editing diffusion model, and the robot executes a goal-conditioned policy. Finally, we use a VLM to label success information of the collected trajectories, and train the policy using this data, resulting in improvement.
  • Figure 3: Trajectory rollout with decomposed language-conditioned policy: The VLM proposes a meaningful language task given the observation of the environment, and this language instruction is used to generate a subgoal image via an image-editing diffusion model. The goal-conditioned policy then tries to achieve a sequence of 5 subgoals (only 3 visualized here) each with $20$ steps. Finally, the VLM determines if the language instruction has been achieved at the end of a trajectory.
  • Figure 4: Across the five unique robot workspaces for the five WidowX robots, there are 10 different scenes, each scene corresponding to a distinct set of manipulatable objects available, and each scene supporting many different tasks that can be executed. 8 scenes support pick-and-place tasks, 1 supports drawer opening and closing, and 1 supports deformable cloth manipulation.
  • Figure 5: Autonomous improvement results in each scene: for all $10$ scenes, training on scene-specific autonomous data helps to significantly improve performance over the pre-trained policy. On average, the pre-trained policy has a success rate of $28\%$ and the improved policy has success rate $57\%$.
  • ...and 1 more figures