Table of Contents
Fetching ...

Online 3D Scene Reconstruction Using Neural Object Priors

Thomas Chabal, Shizhe Chen, Jean Ponce, Cordelia Schmid

TL;DR

This work tackles online reconstruction of scenes at the level of individual objects from RGB-D video by introducing an object-centric neural implicit representation driven by per-object feature grids and small MLPs. A key contribution is feature-grid interpolation, which incrementally extends object geometry as new parts appear, enabling online operation. The second major contribution is an object library of prior shapes that can be retrieved, registered, and used to initialize current-object models, with synthesized keyframes from priors to prevent forgetting past details. Experiments on Replica, ScanNet, and lab-recorded sequences show that object priors improve reconstruction accuracy and completeness, outperforming several state-of-the-art NeRF-based and TSDF baselines, while maintaining online efficiency. The approach enables more faithful, complete, and reusable object reconstructions in dynamic scenes, with practical implications for AR, robotics, and virtual reality.

Abstract

This paper addresses the problem of reconstructing a scene online at the level of objects given an RGB-D video sequence. While current object-aware neural implicit representations hold promise, they are limited in online reconstruction efficiency and shape completion. Our main contributions to alleviate the above limitations are twofold. First, we propose a feature grid interpolation mechanism to continuously update grid-based object-centric neural implicit representations as new object parts are revealed. Second, we construct an object library with previously mapped objects in advance and leverage the corresponding shape priors to initialize geometric object models in new videos, subsequently completing them with novel views as well as synthesized past views to avoid losing original object details. Extensive experiments on synthetic environments from the Replica dataset, real-world ScanNet sequences and videos captured in our laboratory demonstrate that our approach outperforms state-of-the-art neural implicit models for this task in terms of reconstruction accuracy and completeness.

Online 3D Scene Reconstruction Using Neural Object Priors

TL;DR

This work tackles online reconstruction of scenes at the level of individual objects from RGB-D video by introducing an object-centric neural implicit representation driven by per-object feature grids and small MLPs. A key contribution is feature-grid interpolation, which incrementally extends object geometry as new parts appear, enabling online operation. The second major contribution is an object library of prior shapes that can be retrieved, registered, and used to initialize current-object models, with synthesized keyframes from priors to prevent forgetting past details. Experiments on Replica, ScanNet, and lab-recorded sequences show that object priors improve reconstruction accuracy and completeness, outperforming several state-of-the-art NeRF-based and TSDF baselines, while maintaining online efficiency. The approach enables more faithful, complete, and reusable object reconstructions in dynamic scenes, with practical implications for AR, robotics, and virtual reality.

Abstract

This paper addresses the problem of reconstructing a scene online at the level of objects given an RGB-D video sequence. While current object-aware neural implicit representations hold promise, they are limited in online reconstruction efficiency and shape completion. Our main contributions to alleviate the above limitations are twofold. First, we propose a feature grid interpolation mechanism to continuously update grid-based object-centric neural implicit representations as new object parts are revealed. Second, we construct an object library with previously mapped objects in advance and leverage the corresponding shape priors to initialize geometric object models in new videos, subsequently completing them with novel views as well as synthesized past views to avoid losing original object details. Extensive experiments on synthetic environments from the Replica dataset, real-world ScanNet sequences and videos captured in our laboratory demonstrate that our approach outperforms state-of-the-art neural implicit models for this task in terms of reconstruction accuracy and completeness.

Paper Structure

This paper contains 54 sections, 4 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Our method reconstructs scenes at the level of objects from RGB-D videos on the fly. We leverage 3D shape priors from a pre-computed object library to enhance accuracy and completeness of geometry reconstruction for individual objects.
  • Figure 2: (Left) Our object-centric representation. Given a 3D point $x$ inside the object bounding box, we predict its occupancy and color values via two small feature grids and MLPs. The object model is trained by volume rendering. (Right) Given the mapping between object bounding boxes at times $t-1$ and $t$, we retrieve features in the former feature grid to update the new one.
  • Figure 3: Overview of the procedure to integrate prior object models. (a) Retrieval: given a newly segmented object, we retrieve the most similar object in the object library via CLIP embedding. (b) Registration: we get an aligned pose of the retrieved object via point cloud registration. (c) Shape refinement: we refine the initial shape model with novel views while additionally synthesizing keyframes from retrieved object models to not lose shape details.
  • Figure 4: Examples of reconstructions with our method on different Replica scenes, compared to vMAP kong23vmap. Our method recovers object geometry that is more faithful to the actual shapes and with better texture.
  • Figure 5: Reconstruction of a ScanNet sequence with vMAP and our method, with close-up views on some parts. Our method recovers more accurate geometries than vMAP, which over-smoothes surfaces - see in particular the piano and plant on the right or the sofa on the left - though it is a bit more sensitive to ScanNet's noisy inputs.
  • ...and 12 more figures