Table of Contents
Fetching ...

Robo3R: Enhancing Robotic Manipulation with Accurate Feed-Forward 3D Reconstruction

Sizhe Yang, Linning Xu, Hao Li, Juncheng Mu, Jia Zeng, Dahua Lin, Jiangmiao Pang

TL;DR

Robo3R addresses the challenge of obtaining reliable metric-scale 3D geometry for robotic manipulation directly from RGB images by learning a feed-forward reconstruction that fuses image data with robot state. It introduces a scale-invariant local 3D representation and a global similarity transformation to achieve metric-scale geometry in the canonical robot frame, refined by a PnP-based extrinsic module and a masked point head for sharp geometry. Trained on Robo3R-4M, a large-scale synthetic dataset, Robo3R outperforms state-of-the-art feed-forward methods and depth sensors across 3D reconstruction quality and a suite of downstream tasks, including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning, particularly in challenging materials and small objects. The approach reduces dependence on depth sensing and calibration, enabling robust, manipulation-ready perception with real-time performance and broad applicability in robotic manipulation scenarios.

Abstract

3D spatial perception is fundamental to generalizable robotic manipulation, yet obtaining reliable, high-quality 3D geometry remains challenging. Depth sensors suffer from noise and material sensitivity, while existing reconstruction models lack the precision and metric consistency required for physical interaction. We introduce Robo3R, a feed-forward, manipulation-ready 3D reconstruction model that predicts accurate, metric-scale scene geometry directly from RGB images and robot states in real time. Robo3R jointly infers scale-invariant local geometry and relative camera poses, which are unified into the scene representation in the canonical robot frame via a learned global similarity transformation. To meet the precision demands of manipulation, Robo3R employs a masked point head for sharp, fine-grained point clouds, and a keypoint-based Perspective-n-Point (PnP) formulation to refine camera extrinsics and global alignment. Trained on Robo3R-4M, a curated large-scale synthetic dataset with four million high-fidelity annotated frames, Robo3R consistently outperforms state-of-the-art reconstruction methods and depth sensors. Across downstream tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning, we observe consistent gains in performance, suggesting the promise of this alternative 3D sensing module for robotic manipulation.

Robo3R: Enhancing Robotic Manipulation with Accurate Feed-Forward 3D Reconstruction

TL;DR

Robo3R addresses the challenge of obtaining reliable metric-scale 3D geometry for robotic manipulation directly from RGB images by learning a feed-forward reconstruction that fuses image data with robot state. It introduces a scale-invariant local 3D representation and a global similarity transformation to achieve metric-scale geometry in the canonical robot frame, refined by a PnP-based extrinsic module and a masked point head for sharp geometry. Trained on Robo3R-4M, a large-scale synthetic dataset, Robo3R outperforms state-of-the-art feed-forward methods and depth sensors across 3D reconstruction quality and a suite of downstream tasks, including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning, particularly in challenging materials and small objects. The approach reduces dependence on depth sensing and calibration, enabling robust, manipulation-ready perception with real-time performance and broad applicability in robotic manipulation scenarios.

Abstract

3D spatial perception is fundamental to generalizable robotic manipulation, yet obtaining reliable, high-quality 3D geometry remains challenging. Depth sensors suffer from noise and material sensitivity, while existing reconstruction models lack the precision and metric consistency required for physical interaction. We introduce Robo3R, a feed-forward, manipulation-ready 3D reconstruction model that predicts accurate, metric-scale scene geometry directly from RGB images and robot states in real time. Robo3R jointly infers scale-invariant local geometry and relative camera poses, which are unified into the scene representation in the canonical robot frame via a learned global similarity transformation. To meet the precision demands of manipulation, Robo3R employs a masked point head for sharp, fine-grained point clouds, and a keypoint-based Perspective-n-Point (PnP) formulation to refine camera extrinsics and global alignment. Trained on Robo3R-4M, a curated large-scale synthetic dataset with four million high-fidelity annotated frames, Robo3R consistently outperforms state-of-the-art reconstruction methods and depth sensors. Across downstream tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning, we observe consistent gains in performance, suggesting the promise of this alternative 3D sensing module for robotic manipulation.
Paper Structure (34 sections, 14 equations, 11 figures, 8 tables)

This paper contains 34 sections, 14 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Overview.Robo3R enables manipulation-ready 3D reconstruction from RGB frames in real time. By achieving accurate metric-scale 3D geometry in the canonical robot frame, Robo3R eliminates the need for depth sensors and calibration, while improving accuracy and robustness in challenging manipulation scenarios. These features lead to notable improvements in downstream applications such as imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning.
  • Figure 2: Method Overview. RGB images and robot states are encoded and fused. The transformer backbone processes the resulting features through alternating global and frame-wise attention. The masked point head decodes scale-invariant local geometry, while the relative pose head outputs relative poses for registering points across multiple views. S.T. tokens read out the global similarity transformation, which maps the points into metric-scale 3D geometry in the canonical robot frame.
  • Figure 3: Masked point head. To address the over-smoothing problem for dense prediction, we propose a masked point head that decomposes point prediction into depth, normalized image coordinate, and mask predictions. Through unprojection, masking, and combination, we obtain sharp points with fine-grained geometric details.
  • Figure 4: Extrinsic estimation module. The extrinsic estimation module extracts robot keypoints and accurately estimates the camera extrinsics by solving the Perspective-n-Point (PnP) problem; the camera extrinsics are used to refine the global similarity transformation.
  • Figure 5: Data samples. The dataset showcases a diverse array of assets with extensive randomization, encompassing rich modalities and comprehensive annotations.
  • ...and 6 more figures