Robo3R: Enhancing Robotic Manipulation with Accurate Feed-Forward 3D Reconstruction
Sizhe Yang, Linning Xu, Hao Li, Juncheng Mu, Jia Zeng, Dahua Lin, Jiangmiao Pang
TL;DR
Robo3R addresses the challenge of obtaining reliable metric-scale 3D geometry for robotic manipulation directly from RGB images by learning a feed-forward reconstruction that fuses image data with robot state. It introduces a scale-invariant local 3D representation and a global similarity transformation to achieve metric-scale geometry in the canonical robot frame, refined by a PnP-based extrinsic module and a masked point head for sharp geometry. Trained on Robo3R-4M, a large-scale synthetic dataset, Robo3R outperforms state-of-the-art feed-forward methods and depth sensors across 3D reconstruction quality and a suite of downstream tasks, including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning, particularly in challenging materials and small objects. The approach reduces dependence on depth sensing and calibration, enabling robust, manipulation-ready perception with real-time performance and broad applicability in robotic manipulation scenarios.
Abstract
3D spatial perception is fundamental to generalizable robotic manipulation, yet obtaining reliable, high-quality 3D geometry remains challenging. Depth sensors suffer from noise and material sensitivity, while existing reconstruction models lack the precision and metric consistency required for physical interaction. We introduce Robo3R, a feed-forward, manipulation-ready 3D reconstruction model that predicts accurate, metric-scale scene geometry directly from RGB images and robot states in real time. Robo3R jointly infers scale-invariant local geometry and relative camera poses, which are unified into the scene representation in the canonical robot frame via a learned global similarity transformation. To meet the precision demands of manipulation, Robo3R employs a masked point head for sharp, fine-grained point clouds, and a keypoint-based Perspective-n-Point (PnP) formulation to refine camera extrinsics and global alignment. Trained on Robo3R-4M, a curated large-scale synthetic dataset with four million high-fidelity annotated frames, Robo3R consistently outperforms state-of-the-art reconstruction methods and depth sensors. Across downstream tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning, we observe consistent gains in performance, suggesting the promise of this alternative 3D sensing module for robotic manipulation.
