Table of Contents
Fetching ...

3D-MVP: 3D Multiview Pretraining for Robotic Manipulation

Shengyi Qian, Kaichun Mo, Valts Blukis, David F. Fouhey, Dieter Fox, Ankit Goyal

TL;DR

3D-MVP tackles the scarcity of diverse robotics data by pretraining the visual encoder of the Robotic View Transformer (RVT) on multi-view 3D scenes using masked autoencoding, leveraging large-scale datasets like Objaverse. The encoder is trained to reconstruct five orthogonal RGB-D views from masked tokens, encouraging 3D-aware representations, and is then fine-tuned with RVT's action decoder on downstream manipulation tasks. Empirical results on RLBench and COLOSSEUM show that 3D-MVP outperforms 2D pretraining and scratch baselines, with notable gains in medium-difficulty tasks and robustness to environmental variations. This approach demonstrates the value of 3D-aware pretraining for improving generalization and sample efficiency in vision-based robotic manipulation, with implications for scalable 3D scene understanding in robotics.

Abstract

Recent works have shown that visual pretraining on egocentric datasets using masked autoencoders (MAE) can improve generalization for downstream robotics tasks. However, these approaches pretrain only on 2D images, while many robotics applications require 3D scene understanding. In this work, we propose 3D-MVP, a novel approach for 3D Multi-View Pretraining using masked autoencoders. We leverage Robotic View Transformer (RVT), which uses a multi-view transformer to understand the 3D scene and predict gripper pose actions. We split RVT's multi-view transformer into visual encoder and action decoder, and pretrain its visual encoder using masked autoencoding on large-scale 3D datasets such as Objaverse. We evaluate 3D-MVP on a suite of virtual robot manipulation tasks and demonstrate improved performance over baselines. Our results suggest that 3D-aware pretraining is a promising approach to improve generalization of vision-based robotic manipulation policies. Project site: https://jasonqsy.github.io/3DMVP

3D-MVP: 3D Multiview Pretraining for Robotic Manipulation

TL;DR

3D-MVP tackles the scarcity of diverse robotics data by pretraining the visual encoder of the Robotic View Transformer (RVT) on multi-view 3D scenes using masked autoencoding, leveraging large-scale datasets like Objaverse. The encoder is trained to reconstruct five orthogonal RGB-D views from masked tokens, encouraging 3D-aware representations, and is then fine-tuned with RVT's action decoder on downstream manipulation tasks. Empirical results on RLBench and COLOSSEUM show that 3D-MVP outperforms 2D pretraining and scratch baselines, with notable gains in medium-difficulty tasks and robustness to environmental variations. This approach demonstrates the value of 3D-aware pretraining for improving generalization and sample efficiency in vision-based robotic manipulation, with implications for scalable 3D scene understanding in robotics.

Abstract

Recent works have shown that visual pretraining on egocentric datasets using masked autoencoders (MAE) can improve generalization for downstream robotics tasks. However, these approaches pretrain only on 2D images, while many robotics applications require 3D scene understanding. In this work, we propose 3D-MVP, a novel approach for 3D Multi-View Pretraining using masked autoencoders. We leverage Robotic View Transformer (RVT), which uses a multi-view transformer to understand the 3D scene and predict gripper pose actions. We split RVT's multi-view transformer into visual encoder and action decoder, and pretrain its visual encoder using masked autoencoding on large-scale 3D datasets such as Objaverse. We evaluate 3D-MVP on a suite of virtual robot manipulation tasks and demonstrate improved performance over baselines. Our results suggest that 3D-aware pretraining is a promising approach to improve generalization of vision-based robotic manipulation policies. Project site: https://jasonqsy.github.io/3DMVP
Paper Structure (12 sections, 6 equations, 5 figures, 2 tables)

This paper contains 12 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Comparison of 2D vs. 3D pretraining for manipulation.(Left): In 2D pretraining xiao2022masked, the model is trained to do MAE reconstruction from a single image from Interaction videos grauman2022ego4dShan20. The encoder is then used for downstream manipulation tasks. However, the input can only be a single 2D image due to the pretraining. (Right): We propose 3D Multi-View Pretraining (3D-MVP), which uses multiple orthogonal RGB-D views of a 3D model. The model is tasked with reconstructing the masked views by leveraging information across different perspectives, enabling it to learn more robust 3D spatial representations. This multi-view approach improves downstream performance in robot manipulation tasks by capturing richer scene understanding compared to 2D-only pretraining. And it is compatible with state-of-the-art 3D manipulation method such as RVT goyal2023rvt which takes 3D inputs.
  • Figure 2: Overview of 3D-MVP. (a) We first pretrain a Multiview 3D Transformer using masked autoencoder on multiview RGB-D images. (b) We then finetune the pretrained Multiview 3D Transformer on manipulation tasks. Since the MVT is pretrained, the learned manipulation policy generalizes better. For example, it is more robust to changes of texture, size and lighting.
  • Figure 3: MAE Reconstruction results on Objaverse. Our pretrained multi-view transformer generalizes to unseen object instances and reconstructs multi-view images from their masked versions.
  • Figure 4: Results on COLOSSEUM pumacay2024colosseum. We report the average task completion success rate for 12 environmental perturbations and no perturbation. Manipulation policies which do explicit 3D reasoning (RVT goyal2023rvt works significantly better and 2D pretraining approaches (MVP xiao2022masked and R3M nair2022r3m). 3D-MVP is more robust than RVT on most perturbations. MO = manipulation object. RO = receiver object.
  • Figure 5: Pretraining MAE on RLBench scenes leads poor generalization performance. (Left): MAE reconstruction results on unseen RLBench renderings. (Right): MAE reconstruction results on Objaverse renderings. While the reconstruction is reasonable on RLBench unseen renderings, it overfits to RLBench and does not learn a general representation.