BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence
Xuewu Lin, Tianwei Lin, Lichao Huang, Hongyu Xie, Zhizhong Su
TL;DR
BIP3D introduces an image-centric approach to 3D perception for embodied intelligence, bridging 2D image features with 3D understanding by integrating explicit 3D position encoding, camera modeling, and depth-aware fusion. Built atop GroundingDINO, it enables multi-view, multi-modal 3D detection and grounding using RGB inputs (with depth as optional) and outperforms state-of-the-art on EmbodiedScan, notably improving AP3D@0.25 by 5.69% for detection and achieving a 15.25% gain in 3D grounding on validation. Key innovations include the Feature Enhancer, Spatial Enhancer, and a 3D multi-view fusion decoder, along with camera intrinsic standardization and a robust training objective leveraging DETR-style losses and 3D Wasserstein-based box regression. The work demonstrates that 2D foundation models, when coupled with explicit 3D geometry and multi-view fusion, can surpass point-centric methods and enable scalable, RGB-only data collection for embodied AI applications, with potential extensions to dynamic scenes and higher-level tasks.
Abstract
In embodied intelligence systems, a key component is 3D perception algorithm, which enables agents to understand their surrounding environments. Previous algorithms primarily rely on point cloud, which, despite offering precise geometric information, still constrain perception performance due to inherent sparsity, noise, and data scarcity. In this work, we introduce a novel image-centric 3D perception model, BIP3D, which leverages expressive image features with explicit 3D position encoding to overcome the limitations of point-centric methods. Specifically, we leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding. Together, these modules enable BIP3D to achieve multi-view, multi-modal feature fusion and end-to-end 3D perception. In our experiments, BIP3D outperforms current state-of-the-art results on the EmbodiedScan benchmark, achieving improvements of 5.69% in the 3D detection task and 15.25% in the 3D visual grounding task.
