Table of Contents
Fetching ...

PE3R: Perception-Efficient 3D Reconstruction

Jie Hu, Shizun Wang, Xinchao Wang

TL;DR

PE3R tackles the challenge of accurate and fast 3D semantic reconstruction from 2D images without relying on calibrated 3D data or scene-specific training. It introduces a feed-forward framework with three modules—pixel embedding disambiguation, semantic field reconstruction, and global view perception—to achieve robust zero-shot generalization across diverse scenes. Empirical results show a roughly 9x speedup in reconstructing 3D semantic fields and improved segmentation and depth accuracy, validating PE3R's efficiency and versatility on open-vocabulary segmentation and multi-view depth tasks. The work suggests substantial practical impact for real-time 3D scene understanding in robotics, AR/VR, and autonomous systems, while acknowledging ethical considerations in broad deployment.

Abstract

Recent advancements in 2D-to-3D perception have significantly improved the understanding of 3D scenes from 2D images. However, existing methods face critical challenges, including limited generalization across scenes, suboptimal perception accuracy, and slow reconstruction speeds. To address these limitations, we propose Perception-Efficient 3D Reconstruction (PE3R), a novel framework designed to enhance both accuracy and efficiency. PE3R employs a feed-forward architecture to enable rapid 3D semantic field reconstruction. The framework demonstrates robust zero-shot generalization across diverse scenes and objects while significantly improving reconstruction speed. Extensive experiments on 2D-to-3D open-vocabulary segmentation and 3D reconstruction validate the effectiveness and versatility of PE3R. The framework achieves a minimum 9-fold speedup in 3D semantic field reconstruction, along with substantial gains in perception accuracy and reconstruction precision, setting new benchmarks in the field. The code is publicly available at: https://github.com/hujiecpp/PE3R.

PE3R: Perception-Efficient 3D Reconstruction

TL;DR

PE3R tackles the challenge of accurate and fast 3D semantic reconstruction from 2D images without relying on calibrated 3D data or scene-specific training. It introduces a feed-forward framework with three modules—pixel embedding disambiguation, semantic field reconstruction, and global view perception—to achieve robust zero-shot generalization across diverse scenes. Empirical results show a roughly 9x speedup in reconstructing 3D semantic fields and improved segmentation and depth accuracy, validating PE3R's efficiency and versatility on open-vocabulary segmentation and multi-view depth tasks. The work suggests substantial practical impact for real-time 3D scene understanding in robotics, AR/VR, and autonomous systems, while acknowledging ethical considerations in broad deployment.

Abstract

Recent advancements in 2D-to-3D perception have significantly improved the understanding of 3D scenes from 2D images. However, existing methods face critical challenges, including limited generalization across scenes, suboptimal perception accuracy, and slow reconstruction speeds. To address these limitations, we propose Perception-Efficient 3D Reconstruction (PE3R), a novel framework designed to enhance both accuracy and efficiency. PE3R employs a feed-forward architecture to enable rapid 3D semantic field reconstruction. The framework demonstrates robust zero-shot generalization across diverse scenes and objects while significantly improving reconstruction speed. Extensive experiments on 2D-to-3D open-vocabulary segmentation and 3D reconstruction validate the effectiveness and versatility of PE3R. The framework achieves a minimum 9-fold speedup in 3D semantic field reconstruction, along with substantial gains in perception accuracy and reconstruction precision, setting new benchmarks in the field. The code is publicly available at: https://github.com/hujiecpp/PE3R.

Paper Structure

This paper contains 12 sections, 2 theorems, 16 equations, 6 figures, 6 tables.

Key Result

Proposition 3.1

Vector Normalization: For any unit vectors $\mathbf{F}_A$ and $\mathbf{F}_B$, $\hat{\mathbf{F}}_B$ remains a unit vector, ensuring it lies within the same semantic space as $\mathbf{F}_A$ and $\mathbf{F}_B$.

Figures (6)

  • Figure 1: Visualizations for Perception-Efficient 3D Reconstruction. PE3R reconstructs 3D scenes using only 2D images and enables semantic understanding through language. The framework achieves efficiency in two key aspects. First, input efficiency allows it to operate solely with 2D images, eliminating the need for additional 3D data such as camera parameters or depth information. Second, time efficiency ensures significantly faster 3D semantic reconstruction compared to previous methods. These capabilities make PE3R highly suitable for scenarios where obtaining 3D data is challenging and for applications requiring large-scale or real-time processing.
  • Figure 2: PE3R Framework. In pixel embedding disambiguation, a foundational segmentation model (e.g., SAM) segments the input image into multi-level masks. A tracking model (e.g., SAM2) then assigns consistent labels to these masks across different views. The image regions filtered by these masks are encoded using an image encoder (e.g., CLIP), aggregated through area-moving, and mapped back to generate pixel embeddings. For semantic field reconstruction, a feed-forward model (e.g., DUSt3R) predicts pointmaps. These pointmaps are combined with pixel embeddings through semantic-guided refinement to produce a refined 3D semantic field. In global view perception, text embeddings generated by a text encoder (e.g., CLIP) are matched with 3D point embeddings to locate semantic targets via global similarity normalization.
  • Figure 3: Ablation Studies on Multi-Level Disambiguation. Without the use of multi-level disambiguation, the model is able to identify parts of objects but faces challenges in accurately localizing the semantics of entire objects.
  • Figure 4: Ablation Studies on Cross-View Disambiguation. Without cross-view disambiguation, semantic inconsistencies arise due to the challenges posed by varying viewing angles and occlusions.
  • Figure 5: Ablation Studies for PE3R with or without Global Min-Max Normalization. The absence of global min-max normalization leads to noise in the results.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Proposition 3.1
  • proof
  • Proposition 3.2
  • proof