DORec: Decomposed Object Reconstruction and Segmentation Utilizing 2D Self-Supervised Features
Jun Wu, Sicheng Li, Sihui Ji, Yifei Yang, Yue Wang, Rong Xiong, Yiyi Liao
TL;DR
DORec addresses the challenge of decomposed object reconstruction in cluttered scenes without manual 2D labels by introducing two self-supervised masks: a coarse binary foreground mask and a median-grained K-cluster mask. It employs a compositional neural implicit network with a foreground surface model and a background radiance field, using max fusion to separate foreground geometry and background appearance. The approach is validated across diverse real-world datasets, showing improved foreground segmentation and object-level reconstruction, with ablations confirming the value of dual-mask supervision and the two-stage training strategy. The work demonstrates potential for downstream robotics tasks such as pose estimation, though it notes long training times and points to efficiency improvements as future work.
Abstract
Recovering 3D geometry and textures of individual objects is crucial for many robotics applications, such as manipulation, pose estimation, and autonomous driving. However, decomposing a target object from a complex background is challenging. Most existing approaches rely on costly manual labels to acquire object instance perception. Recent advancements in 2D self-supervised learning offer new prospects for identifying objects of interest, yet leveraging such noisy 2D features for clean decomposition remains difficult. In this paper, we propose a Decomposed Object Reconstruction (DORec) network based on neural implicit representations. Our key idea is to use 2D self-supervised features to create two levels of masks for supervision: a binary mask for foreground regions and a K-cluster mask for semantically similar regions. These complementary masks result in robust decomposition. Experimental results on different datasets show DORec's superiority in segmenting and reconstructing diverse foreground objects from varied backgrounds enabling downstream tasks such as pose estimation.
