Table of Contents
Fetching ...

Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning

Albert Wilcox, Mohamed Ghanem, Masoud Moghani, Pierre Barroso, Benjamin Joffe, Animesh Garg

TL;DR

Adapt3R introduces a general-purpose 3D observation encoder designed for cross-embodiment and zero-shot viewpoint transfer in imitation learning. By offloading semantic reasoning to a pretrained 2D backbone and using 3D information solely to localize this semantics with respect to the end-effector, Adapt3R produces a compact conditioning vector that can be used with diverse IL algorithms and trained end-to-end. Across 93 simulated and 6 real tasks, it achieves strong multitask performance and robust zero-shot transfer, including unseen embodiments and camera poses, with notable real-world gains (e.g., a 43.8% improvement over baselines in real experiments). The work highlights the practical potential of combining 2D semantic features with 3D localization for scalable, generalizable robotic manipulation, while acknowledging limitations related to depth calibration, scene geometry, and cross-embodiment scope.

Abstract

Imitation Learning can train robots to perform complex and diverse manipulation tasks, but learned policies are brittle with observations outside of the training distribution. 3D scene representations that incorporate observations from calibrated RGBD cameras have been proposed as a way to mitigate this, but in our evaluations with unseen embodiments and camera viewpoints they show only modest improvement. To address those challenges, we propose Adapt3R, a general-purpose 3D observation encoder which synthesizes data from calibrated RGBD cameras into a vector that can be used as conditioning for arbitrary IL algorithms. The key idea is to use a pretrained 2D backbone to extract semantic information, using 3D only as a medium to localize this information with respect to the end-effector. We show across 93 simulated and 6 real tasks that when trained end-to-end with a variety of IL algorithms, Adapt3R maintains these algorithms' learning capacity while enabling zero-shot transfer to novel embodiments and camera poses.

Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning

TL;DR

Adapt3R introduces a general-purpose 3D observation encoder designed for cross-embodiment and zero-shot viewpoint transfer in imitation learning. By offloading semantic reasoning to a pretrained 2D backbone and using 3D information solely to localize this semantics with respect to the end-effector, Adapt3R produces a compact conditioning vector that can be used with diverse IL algorithms and trained end-to-end. Across 93 simulated and 6 real tasks, it achieves strong multitask performance and robust zero-shot transfer, including unseen embodiments and camera poses, with notable real-world gains (e.g., a 43.8% improvement over baselines in real experiments). The work highlights the practical potential of combining 2D semantic features with 3D localization for scalable, generalizable robotic manipulation, while acknowledging limitations related to depth calibration, scene geometry, and cross-embodiment scope.

Abstract

Imitation Learning can train robots to perform complex and diverse manipulation tasks, but learned policies are brittle with observations outside of the training distribution. 3D scene representations that incorporate observations from calibrated RGBD cameras have been proposed as a way to mitigate this, but in our evaluations with unseen embodiments and camera viewpoints they show only modest improvement. To address those challenges, we propose Adapt3R, a general-purpose 3D observation encoder which synthesizes data from calibrated RGBD cameras into a vector that can be used as conditioning for arbitrary IL algorithms. The key idea is to use a pretrained 2D backbone to extract semantic information, using 3D only as a medium to localize this information with respect to the end-effector. We show across 93 simulated and 6 real tasks that when trained end-to-end with a variety of IL algorithms, Adapt3R maintains these algorithms' learning capacity while enabling zero-shot transfer to novel embodiments and camera poses.

Paper Structure

This paper contains 36 sections, 3 equations, 12 figures, 18 tables.

Figures (12)

  • Figure 1: Adapt3R extracts scene representations from RGBD inputs for use with a variety of imitation learning algorithms. It lifts pre-trained foundation model features into a point cloud, carefully processes that point cloud, and uses attention pooling to compress it into a single vector $z$ to be used as conditioning for end-to-end learning.
  • Figure 2: We train on the Franka Panda and viewpoint shown in (a). Then, we evaluate zero-shot with the UR5e, Kinova3 and IIWA (b) embodiments, and unseen camera poses (c).
  • Figure 3: Unseen Camera Pose. We rotate the scene camera by $\theta$ radians about the vertical axis through the end-effector starting position. LIBERO-90 results use BAKU and MimicGen (MG) results use DP.
  • Figure 4: Cross Embodiment. We evaluate zero-shot with three unseen embodiments. LIBERO-90 results aggregate across all action decoders and MimicGen (MG) results use DP. Adapt3R and 3DDA consistently outperform comparisons, indicating that semantic-aligned point clouds are conducive to embodiment transfer.
  • Figure 5: Real-Robot Setup. (a) Illustration of Hardware. (b) The viewpoint used to train all policies. (c) The viewpoint used for our zero-shot evaluation experiments.
  • ...and 7 more figures