Table of Contents
Fetching ...

EPRecon: An Efficient Framework for Real-Time Panoptic 3D Reconstruction from Monocular Video

Zhen Zhou, Yunkai Ma, Junfeng Fan, Shaolin Zhang, Fengshui Jing, Min Tan

TL;DR

EPRecon presents an efficient real-time framework for panoptic 3D reconstruction from monocular video by introducing a lightweight 3D depth-prior module that estimates voxel occupancy directly in a volumetric frame, reducing non-surface voxels prior to reconstruction. It then performs depth-guided surface reconstruction and employs a deformable cross-attention–based fusion of voxel and image features to deliver detailed, instance-level panoptic segmentation. On ScanNetV2, EPRecon achieves state-of-the-art panoptic 3D reconstruction quality with real-time inference, significantly outperforming depth-map fusion–based baselines in speed and maintaining, or improving, accuracy. The approach demonstrates the practicality of real-time, densely annotated 3D scene understanding from monocular input, with potential benefits for robotics and AR applications.

Abstract

Panoptic 3D reconstruction from a monocular video is a fundamental perceptual task in robotic scene understanding. However, existing efforts suffer from inefficiency in terms of inference speed and accuracy, limiting their practical applicability. We present EPRecon, an efficient real-time panoptic 3D reconstruction framework. Current volumetric-based reconstruction methods usually utilize multi-view depth map fusion to obtain scene depth priors, which is time-consuming and poses challenges to real-time scene reconstruction. To address this issue, we propose a lightweight module to directly estimate scene depth priors in a 3D volume for reconstruction quality improvement by generating occupancy probabilities of all voxels. In addition, compared with existing panoptic segmentation methods, EPRecon extracts panoptic features from both voxel features and corresponding image features, obtaining more detailed and comprehensive instance-level semantic information and achieving more accurate segmentation results. Experimental results on the ScanNetV2 dataset demonstrate the superiority of EPRecon over current state-of-the-art methods in terms of both panoptic 3D reconstruction quality and real-time inference. Code is available at https://github.com/zhen6618/EPRecon.

EPRecon: An Efficient Framework for Real-Time Panoptic 3D Reconstruction from Monocular Video

TL;DR

EPRecon presents an efficient real-time framework for panoptic 3D reconstruction from monocular video by introducing a lightweight 3D depth-prior module that estimates voxel occupancy directly in a volumetric frame, reducing non-surface voxels prior to reconstruction. It then performs depth-guided surface reconstruction and employs a deformable cross-attention–based fusion of voxel and image features to deliver detailed, instance-level panoptic segmentation. On ScanNetV2, EPRecon achieves state-of-the-art panoptic 3D reconstruction quality with real-time inference, significantly outperforming depth-map fusion–based baselines in speed and maintaining, or improving, accuracy. The approach demonstrates the practicality of real-time, densely annotated 3D scene understanding from monocular input, with potential benefits for robotics and AR applications.

Abstract

Panoptic 3D reconstruction from a monocular video is a fundamental perceptual task in robotic scene understanding. However, existing efforts suffer from inefficiency in terms of inference speed and accuracy, limiting their practical applicability. We present EPRecon, an efficient real-time panoptic 3D reconstruction framework. Current volumetric-based reconstruction methods usually utilize multi-view depth map fusion to obtain scene depth priors, which is time-consuming and poses challenges to real-time scene reconstruction. To address this issue, we propose a lightweight module to directly estimate scene depth priors in a 3D volume for reconstruction quality improvement by generating occupancy probabilities of all voxels. In addition, compared with existing panoptic segmentation methods, EPRecon extracts panoptic features from both voxel features and corresponding image features, obtaining more detailed and comprehensive instance-level semantic information and achieving more accurate segmentation results. Experimental results on the ScanNetV2 dataset demonstrate the superiority of EPRecon over current state-of-the-art methods in terms of both panoptic 3D reconstruction quality and real-time inference. Code is available at https://github.com/zhen6618/EPRecon.
Paper Structure (20 sections, 6 equations, 3 figures, 5 tables)

This paper contains 20 sections, 6 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Architecture of the proposed EPRecon. EPRecon directly estimates scene depth priors in a 3D volume, and then performs depth-guided panoptic reconstruction. To obtain depth information, we predict the surface occupancy probabilities and TSDF values of voxels. Panoptic features are extracted from both voxel features and corresponding image features, obtaining more detailed and comprehensive instance-level semantic information. EPRecon performs panoptic reconstruction within each FBV and gradually recovers the entire scene in chronological order (see the rightmost column for visualization results).
  • Figure 2: Illustration of the proposed depth prior estimation module.
  • Figure 3: Ablation study results of panoptic 3D reconstruction. Under the guidance of the proposed depth prior estimation module, EPRecon recovers more complete and accurate panoptic reconstruction results. Extracting panoptic-related features only from voxel features results in insufficient understanding of more comprehensive instance-level semantic information. Relying only on image features leads to insufficient capture of more detailed information.