Table of Contents
Fetching ...

BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence

Xuewu Lin, Tianwei Lin, Lichao Huang, Hongyu Xie, Zhizhong Su

TL;DR

BIP3D introduces an image-centric approach to 3D perception for embodied intelligence, bridging 2D image features with 3D understanding by integrating explicit 3D position encoding, camera modeling, and depth-aware fusion. Built atop GroundingDINO, it enables multi-view, multi-modal 3D detection and grounding using RGB inputs (with depth as optional) and outperforms state-of-the-art on EmbodiedScan, notably improving AP3D@0.25 by 5.69% for detection and achieving a 15.25% gain in 3D grounding on validation. Key innovations include the Feature Enhancer, Spatial Enhancer, and a 3D multi-view fusion decoder, along with camera intrinsic standardization and a robust training objective leveraging DETR-style losses and 3D Wasserstein-based box regression. The work demonstrates that 2D foundation models, when coupled with explicit 3D geometry and multi-view fusion, can surpass point-centric methods and enable scalable, RGB-only data collection for embodied AI applications, with potential extensions to dynamic scenes and higher-level tasks.

Abstract

In embodied intelligence systems, a key component is 3D perception algorithm, which enables agents to understand their surrounding environments. Previous algorithms primarily rely on point cloud, which, despite offering precise geometric information, still constrain perception performance due to inherent sparsity, noise, and data scarcity. In this work, we introduce a novel image-centric 3D perception model, BIP3D, which leverages expressive image features with explicit 3D position encoding to overcome the limitations of point-centric methods. Specifically, we leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding. Together, these modules enable BIP3D to achieve multi-view, multi-modal feature fusion and end-to-end 3D perception. In our experiments, BIP3D outperforms current state-of-the-art results on the EmbodiedScan benchmark, achieving improvements of 5.69% in the 3D detection task and 15.25% in the 3D visual grounding task.

BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence

TL;DR

BIP3D introduces an image-centric approach to 3D perception for embodied intelligence, bridging 2D image features with 3D understanding by integrating explicit 3D position encoding, camera modeling, and depth-aware fusion. Built atop GroundingDINO, it enables multi-view, multi-modal 3D detection and grounding using RGB inputs (with depth as optional) and outperforms state-of-the-art on EmbodiedScan, notably improving AP3D@0.25 by 5.69% for detection and achieving a 15.25% gain in 3D grounding on validation. Key innovations include the Feature Enhancer, Spatial Enhancer, and a 3D multi-view fusion decoder, along with camera intrinsic standardization and a robust training objective leveraging DETR-style losses and 3D Wasserstein-based box regression. The work demonstrates that 2D foundation models, when coupled with explicit 3D geometry and multi-view fusion, can surpass point-centric methods and enable scalable, RGB-only data collection for embodied AI applications, with potential extensions to dynamic scenes and higher-level tasks.

Abstract

In embodied intelligence systems, a key component is 3D perception algorithm, which enables agents to understand their surrounding environments. Previous algorithms primarily rely on point cloud, which, despite offering precise geometric information, still constrain perception performance due to inherent sparsity, noise, and data scarcity. In this work, we introduce a novel image-centric 3D perception model, BIP3D, which leverages expressive image features with explicit 3D position encoding to overcome the limitations of point-centric methods. Specifically, we leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding. Together, these modules enable BIP3D to achieve multi-view, multi-modal feature fusion and end-to-end 3D perception. In our experiments, BIP3D outperforms current state-of-the-art results on the EmbodiedScan benchmark, achieving improvements of 5.69% in the 3D detection task and 15.25% in the 3D visual grounding task.

Paper Structure

This paper contains 25 sections, 19 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Comparison of Point-centric and Image-centric Model Architectures. Dashed boxes denote optional pluggable modules. (a) The point-centric model centers its parameters within the 3D encoder, utilizing feature representations like point or 3D voxel features. (b) By contrast, the image-centric model emphasizes the 2D encoder, using 2D feature maps for its representations.
  • Figure 2: The Architecture Diagram of BIP3D, where $\textcolor{red}{\star}$ indicates the parts that have been modified or added compared to the base model, GroundingDINO groundingdino, and dashed lines indicate optional elements.
  • Figure A.1: Latency Comparison, where '*' indicates the inclusion of point cloud preprocessing time, encompassing multi-view aggregation and down-sampling.
  • Figure A.2: Images Comparison Before and After Camera Intrinsic Standardization. Left: Original, Right: Standardized.
  • Figure A.3: The 3D Bounding Box Corners Permutations. For the same bounding box, there are a total of 48 different corner point permutation; the corner point order is indicated by numbers, with red, yellow, and green representing width, length, and height, respectively.
  • ...and 3 more figures