Table of Contents
Fetching ...

DORec: Decomposed Object Reconstruction and Segmentation Utilizing 2D Self-Supervised Features

Jun Wu, Sicheng Li, Sihui Ji, Yifei Yang, Yue Wang, Rong Xiong, Yiyi Liao

TL;DR

DORec addresses the challenge of decomposed object reconstruction in cluttered scenes without manual 2D labels by introducing two self-supervised masks: a coarse binary foreground mask and a median-grained K-cluster mask. It employs a compositional neural implicit network with a foreground surface model and a background radiance field, using max fusion to separate foreground geometry and background appearance. The approach is validated across diverse real-world datasets, showing improved foreground segmentation and object-level reconstruction, with ablations confirming the value of dual-mask supervision and the two-stage training strategy. The work demonstrates potential for downstream robotics tasks such as pose estimation, though it notes long training times and points to efficiency improvements as future work.

Abstract

Recovering 3D geometry and textures of individual objects is crucial for many robotics applications, such as manipulation, pose estimation, and autonomous driving. However, decomposing a target object from a complex background is challenging. Most existing approaches rely on costly manual labels to acquire object instance perception. Recent advancements in 2D self-supervised learning offer new prospects for identifying objects of interest, yet leveraging such noisy 2D features for clean decomposition remains difficult. In this paper, we propose a Decomposed Object Reconstruction (DORec) network based on neural implicit representations. Our key idea is to use 2D self-supervised features to create two levels of masks for supervision: a binary mask for foreground regions and a K-cluster mask for semantically similar regions. These complementary masks result in robust decomposition. Experimental results on different datasets show DORec's superiority in segmenting and reconstructing diverse foreground objects from varied backgrounds enabling downstream tasks such as pose estimation.

DORec: Decomposed Object Reconstruction and Segmentation Utilizing 2D Self-Supervised Features

TL;DR

DORec addresses the challenge of decomposed object reconstruction in cluttered scenes without manual 2D labels by introducing two self-supervised masks: a coarse binary foreground mask and a median-grained K-cluster mask. It employs a compositional neural implicit network with a foreground surface model and a background radiance field, using max fusion to separate foreground geometry and background appearance. The approach is validated across diverse real-world datasets, showing improved foreground segmentation and object-level reconstruction, with ablations confirming the value of dual-mask supervision and the two-stage training strategy. The work demonstrates potential for downstream robotics tasks such as pose estimation, though it notes long training times and points to efficiency improvements as future work.

Abstract

Recovering 3D geometry and textures of individual objects is crucial for many robotics applications, such as manipulation, pose estimation, and autonomous driving. However, decomposing a target object from a complex background is challenging. Most existing approaches rely on costly manual labels to acquire object instance perception. Recent advancements in 2D self-supervised learning offer new prospects for identifying objects of interest, yet leveraging such noisy 2D features for clean decomposition remains difficult. In this paper, we propose a Decomposed Object Reconstruction (DORec) network based on neural implicit representations. Our key idea is to use 2D self-supervised features to create two levels of masks for supervision: a binary mask for foreground regions and a K-cluster mask for semantically similar regions. These complementary masks result in robust decomposition. Experimental results on different datasets show DORec's superiority in segmenting and reconstructing diverse foreground objects from varied backgrounds enabling downstream tasks such as pose estimation.
Paper Structure (23 sections, 10 equations, 17 figures, 10 tables)

This paper contains 23 sections, 10 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: Decomposed segmentation and reconstruction. By leveraging self-supervised 2D features in the form of coarse- and median-grained masks, our proposed method DORec achieves decomposed object reconstruction without 2D annotations and enables downstream tasks.
  • Figure 2: Method overview. The left part in blue shows our decomposed network consisting of a background model $f_\theta$ and a foreground model $f_\phi$. The background predictions $(\mathbf{c}_b,\sigma_b)$ and foreground predictions $(\mathbf{c}_f, \rho_f)$ are blended together via point-wise max composition. The right part in pink illustrates how we leverage self-supervised 2D features of varying granularity to enable decomposed reconstruction.
  • Figure 3: Point-wise max composition. Unlike wang2021neus, we model points inside the unit sphere using both the foreground and background models. The values are combined via max composition, meaning the lower density is ignored.
  • Figure 4: Median-grained and coarse-grained masks obtained from self-supervised networks.
  • Figure 5: Coarse- and median-grained features of the same scene in two different views.
  • ...and 12 more figures