Table of Contents
Fetching ...

Transferring Foundation Models for Generalizable Robotic Manipulation

Jiange Yang, Wenhui Tan, Chuhao Jin, Keling Yao, Bei Liu, Jianlong Fu, Ruihua Song, Gangshan Wu, Limin Wang

TL;DR

This work addresses the generalization gap in open-domain robotic manipulation by conditioning an end-to-end policy on language-reasoning segmentation masks derived from internet-scale foundation models. It introduces a two-stream TPM policy that fuses RGB images, language-informed masks, and proprioception to predict actions in a closed-loop manner, trained via imitation learning. Real-world experiments on a Franka Emika arm demonstrate improved generalization to unseen objects, backgrounds, and distractors, and the approach extends to additional skills with limited demonstrations. The combination of foundation-model-driven perception and a light-weight, multi-view policy offers a scalable path toward sample-efficient, versatile robotic manipulation.

Abstract

Improving the generalization capabilities of general-purpose robotic manipulation agents in the real world has long been a significant challenge. Existing approaches often rely on collecting large-scale robotic data which is costly and time-consuming, such as the RT-1 dataset. However, due to insufficient diversity of data, these approaches typically suffer from limiting their capability in open-domain scenarios with new objects and diverse environments. In this paper, we propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models, to condition robot manipulation tasks. By integrating the mask modality, which incorporates semantic, geometric, and temporal correlation priors derived from vision foundation models, into the end-to-end policy model, our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning, including new object instances, semantic categories, and unseen backgrounds. We first introduce a series of foundation models to ground natural language demands across multiple tasks. Secondly, we develop a two-stream 2D policy model based on imitation learning, which processes raw images and object masks to predict robot actions with a local-global perception manner. Extensive realworld experiments conducted on a Franka Emika robot arm demonstrate the effectiveness of our proposed paradigm and policy architecture. Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.

Transferring Foundation Models for Generalizable Robotic Manipulation

TL;DR

This work addresses the generalization gap in open-domain robotic manipulation by conditioning an end-to-end policy on language-reasoning segmentation masks derived from internet-scale foundation models. It introduces a two-stream TPM policy that fuses RGB images, language-informed masks, and proprioception to predict actions in a closed-loop manner, trained via imitation learning. Real-world experiments on a Franka Emika arm demonstrate improved generalization to unseen objects, backgrounds, and distractors, and the approach extends to additional skills with limited demonstrations. The combination of foundation-model-driven perception and a light-weight, multi-view policy offers a scalable path toward sample-efficient, versatile robotic manipulation.

Abstract

Improving the generalization capabilities of general-purpose robotic manipulation agents in the real world has long been a significant challenge. Existing approaches often rely on collecting large-scale robotic data which is costly and time-consuming, such as the RT-1 dataset. However, due to insufficient diversity of data, these approaches typically suffer from limiting their capability in open-domain scenarios with new objects and diverse environments. In this paper, we propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models, to condition robot manipulation tasks. By integrating the mask modality, which incorporates semantic, geometric, and temporal correlation priors derived from vision foundation models, into the end-to-end policy model, our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning, including new object instances, semantic categories, and unseen backgrounds. We first introduce a series of foundation models to ground natural language demands across multiple tasks. Secondly, we develop a two-stream 2D policy model based on imitation learning, which processes raw images and object masks to predict robot actions with a local-global perception manner. Extensive realworld experiments conducted on a Franka Emika robot arm demonstrate the effectiveness of our proposed paradigm and policy architecture. Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
Paper Structure (14 sections, 3 equations, 6 figures, 4 tables)

This paper contains 14 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: A demonstration of our task. Receiving human instruction "I want to take a shower", our model can reason out the desired object (i.e., the towel), and then precisely pick and place it near the target object (i.e., the user represented by a Lego toy).
  • Figure 2: Our model comprises four components: (1) GPT-4 reasons target objects based on human demands. (2) A multi-modal prompt generator, comprising object detection and tracking models, transforming input images and target object prompts into bounding boxes. (3) The Segment Anything model, which uses bounding boxes as prompts to segment target objects and generate task-relevant masks. (4) A two-stream policy model that processes images, language-reasonin segmentation masks, and robot proprioception to predict actions.
  • Figure 3: (a): Overview of our workstation, which has a Franka robot arm, a frontal view camera, and a lateral view camera. (b): Seen and unseen objects in the experiments. (c): Three backgrounds in the training data. (d): A challenging background with complex texture for new background evaluation.
  • Figure 4: Some demonstration examples of disturbances scene and other manipulation skills.
  • Figure 5: Our policy model can be conditioned by assigning different values to object masks for different manipulation skills.
  • ...and 1 more figures