Table of Contents
Fetching ...

Visual Robotic Manipulation with Depth-Aware Pretraining

Wanying Wang, Jinming Li, Yichen Zhu, Zhiyuan Xu, Zhengping Che, Yaxin Peng, Chaomin Shen, Dong Liu, Feifei Feng, Jian Tang

TL;DR

Robots operate in 3D spaces, but most visual pretraining relies on 2D data, limiting 3D manipulation performance. The authors introduce Depth-aware Pretraining for Robotics (DPR), a self-supervised framework that uses public 3D data to pretrain an RGB backbone via depth-guided contrastive learning, while no depth information is needed during policy learning or inference. They also present a proprioception injection method to fuse robot state into the policy network. Empirical results on ManiSkill2 and real-robot experiments with a Franka Panda show DPR improves generalization to unseen objects and environments and integrates as a plug-in module for existing manipulation models.

Abstract

Recent work on visual representation learning has shown to be efficient for robotic manipulation tasks. However, most existing works pretrained the visual backbone solely on 2D images or egocentric videos, ignoring the fact that robots learn to act in 3D space, which is hard to learn from 2D observation. In this paper, we examine the effectiveness of pretraining for vision backbone with public-available large-scale 3D data to improve manipulation policy learning. Our method, namely Depth-aware Pretraining for Robotics (DPR), enables an RGB-only backbone to learn 3D scene representations from self-supervised contrastive learning, where depth information serves as auxiliary knowledge. No 3D information is necessary during manipulation policy learning and inference, making our model enjoy both efficiency and effectiveness in 3D space manipulation. Furthermore, we introduce a new way to inject robots' proprioception into the policy networks that makes the manipulation model robust and generalizable. We demonstrate in experiments that our proposed framework improves performance on unseen objects and visual environments for various robotics tasks on both simulated and real robots.

Visual Robotic Manipulation with Depth-Aware Pretraining

TL;DR

Robots operate in 3D spaces, but most visual pretraining relies on 2D data, limiting 3D manipulation performance. The authors introduce Depth-aware Pretraining for Robotics (DPR), a self-supervised framework that uses public 3D data to pretrain an RGB backbone via depth-guided contrastive learning, while no depth information is needed during policy learning or inference. They also present a proprioception injection method to fuse robot state into the policy network. Empirical results on ManiSkill2 and real-robot experiments with a Franka Panda show DPR improves generalization to unseen objects and environments and integrates as a plug-in module for existing manipulation models.

Abstract

Recent work on visual representation learning has shown to be efficient for robotic manipulation tasks. However, most existing works pretrained the visual backbone solely on 2D images or egocentric videos, ignoring the fact that robots learn to act in 3D space, which is hard to learn from 2D observation. In this paper, we examine the effectiveness of pretraining for vision backbone with public-available large-scale 3D data to improve manipulation policy learning. Our method, namely Depth-aware Pretraining for Robotics (DPR), enables an RGB-only backbone to learn 3D scene representations from self-supervised contrastive learning, where depth information serves as auxiliary knowledge. No 3D information is necessary during manipulation policy learning and inference, making our model enjoy both efficiency and effectiveness in 3D space manipulation. Furthermore, we introduce a new way to inject robots' proprioception into the policy networks that makes the manipulation model robust and generalizable. We demonstrate in experiments that our proposed framework improves performance on unseen objects and visual environments for various robotics tasks on both simulated and real robots.
Paper Structure (13 sections, 10 equations, 7 figures, 4 tables)

This paper contains 13 sections, 10 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: An overview of our depth-aware pretraining framework for robotics. We apply identical augmentations to both the input RGB image and the depth map. The two resulting cropped images are channeled through the encoder and the projector. Subsequently, the distance between the two feature maps is computed to distinguish between positive and negative pairs. The pair of depth crops is resized to match the shape of the feature map, and the depth discrepancy is considered. The final decision for positive or negative pairs is based on a combined assessment of both RGB features and depth. During the inference phase, the pretrained encoder is employed for subsequent robotic manipulation tasks.
  • Figure 2: Architecture of the proprioception injection method.
  • Figure 3: We use six contact-rich object manipulation tasks from ManiSkill2 for simulation. Top: rigid-body tasks. Bottom: soft-body tasks.
  • Figure 4: We visualize the final embedding from depth-aware pretrained ResNet18 via GradCAM selvaraju2017gradcam. It appears that our pretrained model segments actionable parts for robots of the scene.
  • Figure 5: Impact of pretraining data size. Pretraining the visual backbone using a large-scale dataset indeed enhances performance.
  • ...and 2 more figures