Table of Contents
Fetching ...

Learning Neural Implicit through Volume Rendering with Attentive Depth Fusion Priors

Pengchong Hu, Zhizhong Han

TL;DR

The paper tackles the challenge of incomplete depth and occlusion in neural implicit 3D reconstruction from RGBD by introducing an attentive depth fusion prior that blends a learned occupancy with a coarse TSDF fused from all available depth images. It presents a volume-rendering framework with hierarchical feature grids and a learnable attention module that decides how much to rely on the TSDF prior versus learned geometry, and it supports both offline and incremental TSDF in SLAM. The approach achieves state-of-the-art surface reconstruction and camera tracking on Replica and ScanNet benchmarks, validated by extensive ablations showing the importance of depth priors, bandwidth tuning, and the attention mechanism. This method improves robustness and fidelity in 3D scene understanding, particularly in settings with streaming RGBD data and pose estimation needs.

Abstract

Learning neural implicit representations has achieved remarkable performance in 3D reconstruction from multi-view images. Current methods use volume rendering to render implicit representations into either RGB or depth images that are supervised by multi-view ground truth. However, rendering a view each time suffers from incomplete depth at holes and unawareness of occluded structures from the depth supervision, which severely affects the accuracy of geometry inference via volume rendering. To resolve this issue, we propose to learn neural implicit representations from multi-view RGBD images through volume rendering with an attentive depth fusion prior. Our prior allows neural networks to perceive coarse 3D structures from the Truncated Signed Distance Function (TSDF) fused from all depth images available for rendering. The TSDF enables accessing the missing depth at holes on one depth image and the occluded parts that are invisible from the current view. By introducing a novel attention mechanism, we allow neural networks to directly use the depth fusion prior with the inferred occupancy as the learned implicit function. Our attention mechanism works with either a one-time fused TSDF that represents a whole scene or an incrementally fused TSDF that represents a partial scene in the context of Simultaneous Localization and Mapping (SLAM). Our evaluations on widely used benchmarks including synthetic and real-world scans show our superiority over the latest neural implicit methods. Project page: https://machineperceptionlab.github.io/Attentive_DF_Prior/

Learning Neural Implicit through Volume Rendering with Attentive Depth Fusion Priors

TL;DR

The paper tackles the challenge of incomplete depth and occlusion in neural implicit 3D reconstruction from RGBD by introducing an attentive depth fusion prior that blends a learned occupancy with a coarse TSDF fused from all available depth images. It presents a volume-rendering framework with hierarchical feature grids and a learnable attention module that decides how much to rely on the TSDF prior versus learned geometry, and it supports both offline and incremental TSDF in SLAM. The approach achieves state-of-the-art surface reconstruction and camera tracking on Replica and ScanNet benchmarks, validated by extensive ablations showing the importance of depth priors, bandwidth tuning, and the attention mechanism. This method improves robustness and fidelity in 3D scene understanding, particularly in settings with streaming RGBD data and pose estimation needs.

Abstract

Learning neural implicit representations has achieved remarkable performance in 3D reconstruction from multi-view images. Current methods use volume rendering to render implicit representations into either RGB or depth images that are supervised by multi-view ground truth. However, rendering a view each time suffers from incomplete depth at holes and unawareness of occluded structures from the depth supervision, which severely affects the accuracy of geometry inference via volume rendering. To resolve this issue, we propose to learn neural implicit representations from multi-view RGBD images through volume rendering with an attentive depth fusion prior. Our prior allows neural networks to perceive coarse 3D structures from the Truncated Signed Distance Function (TSDF) fused from all depth images available for rendering. The TSDF enables accessing the missing depth at holes on one depth image and the occluded parts that are invisible from the current view. By introducing a novel attention mechanism, we allow neural networks to directly use the depth fusion prior with the inferred occupancy as the learned implicit function. Our attention mechanism works with either a one-time fused TSDF that represents a whole scene or an incrementally fused TSDF that represents a partial scene in the context of Simultaneous Localization and Mapping (SLAM). Our evaluations on widely used benchmarks including synthetic and real-world scans show our superiority over the latest neural implicit methods. Project page: https://machineperceptionlab.github.io/Attentive_DF_Prior/
Paper Structure (15 sections, 6 equations, 18 figures, 18 tables)

This paper contains 15 sections, 6 equations, 18 figures, 18 tables.

Figures (18)

  • Figure 1: Overview of our method.
  • Figure 2: Merits of attentive depth fusion prior.
  • Figure 3: Visualization of attention on depth fusion.
  • Figure 4: Visual comparisons in camera tracking.
  • Figure 5: Visual comparison of Error maps (Red: Large).
  • ...and 13 more figures