Table of Contents
Fetching ...

ObjectCarver: Semi-automatic segmentation, reconstruction and separation of 3D objects

Gemmechu Hassena, Jonathan Moon, Ryan Fujii, Andrew Yuen, Noah Snavely, Steve Marschner, Bharath Hariharan

TL;DR

ObjectCarver tackles the challenge of decomposing a multi-object scene into separate, high-quality 3D surfaces without requiring per-view ground-truth masks. It first reconstructs a full scene as a single SDF, then propagates a user-provided 2D segmentation seed across views to obtain multi-view masks, and finally learns per-object SDFs with a novel triad of losses: compactness, overlap, and initialization-based stabilization. The method reduces floaters, handles occlusions, and completes occluded regions, outperforming baselines on a newly introduced real+synthetic dataset with complete object meshes. This approach enables accurate object-level manipulation and has strong practical potential for robotics, AR/VR, and scene editing.

Abstract

Implicit neural fields have made remarkable progress in reconstructing 3D surfaces from multiple images; however, they encounter challenges when it comes to separating individual objects within a scene. Previous work has attempted to tackle this problem by introducing a framework to train separate signed distance fields (SDFs) simultaneously for each of N objects and using a regularization term to prevent objects from overlapping. However, all of these methods require segmentation masks to be provided, which are not always readily available. We introduce our method, ObjectCarver, to tackle the problem of object separation from just click input in a single view. Given posed multi-view images and a set of user-input clicks to prompt segmentation of the individual objects, our method decomposes the scene into separate objects and reconstructs a high-quality 3D surface for each one. We introduce a loss function that prevents floaters and avoids inappropriate carving-out due to occlusion. In addition, we introduce a novel scene initialization method that significantly speeds up the process while preserving geometric details compared to previous approaches. Despite requiring neither ground truth masks nor monocular cues, our method outperforms baselines both qualitatively and quantitatively. In addition, we introduce a new benchmark dataset for evaluation.

ObjectCarver: Semi-automatic segmentation, reconstruction and separation of 3D objects

TL;DR

ObjectCarver tackles the challenge of decomposing a multi-object scene into separate, high-quality 3D surfaces without requiring per-view ground-truth masks. It first reconstructs a full scene as a single SDF, then propagates a user-provided 2D segmentation seed across views to obtain multi-view masks, and finally learns per-object SDFs with a novel triad of losses: compactness, overlap, and initialization-based stabilization. The method reduces floaters, handles occlusions, and completes occluded regions, outperforming baselines on a newly introduced real+synthetic dataset with complete object meshes. This approach enables accurate object-level manipulation and has strong practical potential for robotics, AR/VR, and scene editing.

Abstract

Implicit neural fields have made remarkable progress in reconstructing 3D surfaces from multiple images; however, they encounter challenges when it comes to separating individual objects within a scene. Previous work has attempted to tackle this problem by introducing a framework to train separate signed distance fields (SDFs) simultaneously for each of N objects and using a regularization term to prevent objects from overlapping. However, all of these methods require segmentation masks to be provided, which are not always readily available. We introduce our method, ObjectCarver, to tackle the problem of object separation from just click input in a single view. Given posed multi-view images and a set of user-input clicks to prompt segmentation of the individual objects, our method decomposes the scene into separate objects and reconstructs a high-quality 3D surface for each one. We introduce a loss function that prevents floaters and avoids inappropriate carving-out due to occlusion. In addition, we introduce a novel scene initialization method that significantly speeds up the process while preserving geometric details compared to previous approaches. Despite requiring neither ground truth masks nor monocular cues, our method outperforms baselines both qualitatively and quantitatively. In addition, we introduce a new benchmark dataset for evaluation.
Paper Structure (32 sections, 10 equations, 19 figures, 2 tables, 1 algorithm)

This paper contains 32 sections, 10 equations, 19 figures, 2 tables, 1 algorithm.

Figures (19)

  • Figure 1: Failure cases of SOTA. Using SAM independently on each image precludes corresponding objects between views (Left). Even if one were to solve this correspondence problem, slight errors in SAM output mean that the same object may be segmented differently in the different views (e.g., the top of the vase is included in the vase segment in the left image but not the right). Even with good segmentations, prior work such as ObjectSDF++ objsdf++ introduces floating artifacts, especially those hidden behind other objects (Right).
  • Figure 2: Mask Propagation pipeline: in the first iteration, a user clicks a point on each object to generate a per-object anchor mask, which are then unprojected into 3D (here, we only show unprojected 3D points for the bottom can). These 3D points are subsequently projected back into each image view, while checking for occlusions. The projected points serve as seeds for SAM sam to generate masks for each object (bottom and top cans, door stop). To combine these individual segmentation masks into a single image, we use a depth ordering technique. In the next iterations, all views are used as anchor masks, allowing the pipeline to cover previously unseen regions.
  • Figure 3: Projection to 3D. Left: Example image. Middle: points projected without mask edge erosion and outlier removal, resulting in noisy segmentation outputs. Right: by using mask erosion and outlier removal we obtain clean 3D points and subsequently obtain a correct segmentation output.
  • Figure 4: An occlusion event. The object of interest is the blue cylinder. On the left is the segmentation mask. On the right, the crosses (not included in the segmentation mask) represent points on the blue object that are visible in other views but occluded in this view. The red dotted box is the amodal mask, and its intersection with the occluding cuboid is the set of pixels that are "present" in the blue object, but occluded in this view.
  • Figure 5: Left: Previous datasets, like Replica, feature objects that only includes visible surfaces, not complete surfaces (including hidden surfaces). As a result, using the cropped sub-meshes as ground-truth for object separation is not an adequate evaluation. Middle and right: Our proposed dataset with complete individual objects.
  • ...and 14 more figures