Table of Contents
Fetching ...

Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction

Xiaoyang Lyu, Chirui Chang, Peng Dai, Yang-Tian Sun, Xiaojuan Qi

TL;DR

Total-Decom tackles decomposed 3D scene reconstruction from sparse multi-view imagery with minimal human input by integrating the Segment Anything Model (SAM) with a hybrid implicit-explicit surface representation and a mesh-based region-growing procedure. The pipeline distills SAM features into an implicit reconstruction, extracts explicit meshes, and uses SAM-driven seeds plus topology-aware region growing to segment foreground objects from the background. It achieves high-quality scene and object reconstructions on Replica and ScanNet, requiring on average about one click per object, and enables downstream tasks such as re-texturing and scene editing. The approach offers a scalable, interactive solution to object-level decomposition in complex scenes, with code available for reproducibility.

Abstract

Scene reconstruction from multi-view images is a fundamental problem in computer vision and graphics. Recent neural implicit surface reconstruction methods have achieved high-quality results; however, editing and manipulating the 3D geometry of reconstructed scenes remains challenging due to the absence of naturally decomposed object entities and complex object/background compositions. In this paper, we present Total-Decom, a novel method for decomposed 3D reconstruction with minimal human interaction. Our approach seamlessly integrates the Segment Anything Model (SAM) with hybrid implicit-explicit neural surface representations and a mesh-based region-growing technique for accurate 3D object decomposition. Total-Decom requires minimal human annotations while providing users with real-time control over the granularity and quality of decomposition. We extensively evaluate our method on benchmark datasets and demonstrate its potential for downstream applications, such as animation and scene editing. The code is available at https://github.com/CVMI-Lab/Total-Decom.git.

Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction

TL;DR

Total-Decom tackles decomposed 3D scene reconstruction from sparse multi-view imagery with minimal human input by integrating the Segment Anything Model (SAM) with a hybrid implicit-explicit surface representation and a mesh-based region-growing procedure. The pipeline distills SAM features into an implicit reconstruction, extracts explicit meshes, and uses SAM-driven seeds plus topology-aware region growing to segment foreground objects from the background. It achieves high-quality scene and object reconstructions on Replica and ScanNet, requiring on average about one click per object, and enables downstream tasks such as re-texturing and scene editing. The approach offers a scalable, interactive solution to object-level decomposition in complex scenes, with code available for reproducibility.

Abstract

Scene reconstruction from multi-view images is a fundamental problem in computer vision and graphics. Recent neural implicit surface reconstruction methods have achieved high-quality results; however, editing and manipulating the 3D geometry of reconstructed scenes remains challenging due to the absence of naturally decomposed object entities and complex object/background compositions. In this paper, we present Total-Decom, a novel method for decomposed 3D reconstruction with minimal human interaction. Our approach seamlessly integrates the Segment Anything Model (SAM) with hybrid implicit-explicit neural surface representations and a mesh-based region-growing technique for accurate 3D object decomposition. Total-Decom requires minimal human annotations while providing users with real-time control over the granularity and quality of decomposition. We extensively evaluate our method on benchmark datasets and demonstrate its potential for downstream applications, such as animation and scene editing. The code is available at https://github.com/CVMI-Lab/Total-Decom.git.
Paper Structure (30 sections, 15 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 30 sections, 15 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Indoor scenes consist of complex compositions of objects and backgrounds. Our proposed method, Total-Decom, (a) performs 3D reconstruction from posed multiview images, (b) decomposes the reconstructed mesh to generate high-quality meshes for individual objects and backgrounds with minimal human annotations. This approach facilitates such applications as (c) object re-texturing and (d) scene reconfiguration. For additional demonstrations, please refer to our supplementary materials and videos.
  • Figure 2: Visualization for distilled generalized features.
  • Figure 3: Comparison on different decomposition methods with SAM feature. SAM + region growing represents object extraction with our method. SAM + similarity indicates object extraction with similarity matching in 3D space, following tschernezki2022neuralkobayashi2022decomposing.
  • Figure 4: Visualization of the SAM feature for the same object in different views with t-SNE JMLR:v9:vandermaaten08a. All the features are in the same feature space.
  • Figure 5: Overview of Total-Decom. (1) Foreground and background decomposed neural reconstruction. We have four networks in this stage to predict the geometry, appearance, semantic, and SAM features per point. We follow the ObjSDF++ wu2023objectsdf++ to use the foreground and background compositional representation with pseudo geometry priors and apply $\min$ operation to construct the whole scene. Notably, the foreground is constrained with object distinct loss (Eq. \ref{['eq: obj_distinct']}) and the background is regularized with Manhattan loss (Eq. \ref{['eq: manhattan']}) and floor reflection loss (Eq. \ref{['eq: floor']}). Furthermore, we also train a solely feature network to render the generalized features. (2) Interactive Decomposition. We firstly extract the SAM feature from the feature network into the vertices of the reconstruction mesh. Subsequently, for any given pose, we can render a color image and a feature image. Passing the feature image and user-selected prompt into the SAM decoder allows us to obtain the 2D mask of the regions of interest. Utilizing our newly proposed surface region-growing algorithm, we can then acquire the 3D mesh corresponding to these regions. Our method enables the user to select objects with varying levels of granularity, requiring just one or two clicks.
  • ...and 7 more figures