Table of Contents
Fetching ...

NTO3D: Neural Target Object 3D Reconstruction with Segment Anything

Xiaobao Wei, Renrui Zhang, Jiarui Wu, Jiaming Liu, Ming Lu, Yandong Guo, Shanghang Zhang

TL;DR

NTO3D addresses the problem of reconstructing a single user-specified object inside a scene with neural implicit representations. It introduces a 3D occupancy field that lifts multi-view 2D SAM masks into 3D space and an additional 3D SAM feature field that distills SAM encoder features into the voxel grid, enabling high-quality target-object 3D reconstruction. The method iteratively refines segmentation and then optimizes the neural field with a joint loss that includes color, geometry, and feature terms. Experiments on DTU, LLFF, and BlendedMVS demonstrate significant improvements in segmentation accuracy, rendering quality, and geometry accuracy over state-of-the-art baselines, highlighting the practical impact of combining foundation models with neural fields.

Abstract

Neural 3D reconstruction from multi-view images has recently attracted increasing attention from the community. Existing methods normally learn a neural field for the whole scene, while it is still under-explored how to reconstruct a target object indicated by users. Considering the Segment Anything Model (SAM) has shown effectiveness in segmenting any 2D images, in this paper, we propose NTO3D, a novel high-quality Neural Target Object 3D (NTO3D) reconstruction method, which leverages the benefits of both neural field and SAM. We first propose a novel strategy to lift the multi-view 2D segmentation masks of SAM into a unified 3D occupancy field. The 3D occupancy field is then projected into 2D space and generates the new prompts for SAM. This process is iterative until convergence to separate the target object from the scene. After this, we then lift the 2D features of the SAM encoder into a 3D feature field in order to improve the reconstruction quality of the target object. NTO3D lifts the 2D masks and features of SAM into the 3D neural field for high-quality neural target object 3D reconstruction. We conduct detailed experiments on several benchmark datasets to demonstrate the advantages of our method. The code will be available at: https://github.com/ucwxb/NTO3D.

NTO3D: Neural Target Object 3D Reconstruction with Segment Anything

TL;DR

NTO3D addresses the problem of reconstructing a single user-specified object inside a scene with neural implicit representations. It introduces a 3D occupancy field that lifts multi-view 2D SAM masks into 3D space and an additional 3D SAM feature field that distills SAM encoder features into the voxel grid, enabling high-quality target-object 3D reconstruction. The method iteratively refines segmentation and then optimizes the neural field with a joint loss that includes color, geometry, and feature terms. Experiments on DTU, LLFF, and BlendedMVS demonstrate significant improvements in segmentation accuracy, rendering quality, and geometry accuracy over state-of-the-art baselines, highlighting the practical impact of combining foundation models with neural fields.

Abstract

Neural 3D reconstruction from multi-view images has recently attracted increasing attention from the community. Existing methods normally learn a neural field for the whole scene, while it is still under-explored how to reconstruct a target object indicated by users. Considering the Segment Anything Model (SAM) has shown effectiveness in segmenting any 2D images, in this paper, we propose NTO3D, a novel high-quality Neural Target Object 3D (NTO3D) reconstruction method, which leverages the benefits of both neural field and SAM. We first propose a novel strategy to lift the multi-view 2D segmentation masks of SAM into a unified 3D occupancy field. The 3D occupancy field is then projected into 2D space and generates the new prompts for SAM. This process is iterative until convergence to separate the target object from the scene. After this, we then lift the 2D features of the SAM encoder into a 3D feature field in order to improve the reconstruction quality of the target object. NTO3D lifts the 2D masks and features of SAM into the 3D neural field for high-quality neural target object 3D reconstruction. We conduct detailed experiments on several benchmark datasets to demonstrate the advantages of our method. The code will be available at: https://github.com/ucwxb/NTO3D.
Paper Structure (14 sections, 8 equations, 6 figures, 6 tables)

This paper contains 14 sections, 8 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of NTO3D. First, a user selects a reconstruction target in the scene. Then, our NTO3D utilizes a 3D occupancy field iteratively to merge the multi-view 2D segmentation masks into 3D space. NTO3D further lifts the features of the SAM encoder into a 3D SAM features field and optimizes the feature field together with other fields. Finally, the user can obtain a high-quality 3D reconstruction model of the target object with NTO3D.
  • Figure 2: The overall pipeline of NTO3D. First, the user specifies the target object to be reconstructed and sends prompts to SAM for segmentation on the initial view. With multi-view images as input, we train the 3D occupancy field iteratively to lift cross-view masks into 3D space. When the 3D occupancy field converges to high-quality masks of the target objects, we finetune the pre-trained neural field based on the masked images and distill SAM encoder features into 3D space to obtain better reconstruction quality.
  • Figure 3: The illustration of the 3D occupancy field. Implicit interaction between multiple rays to decide which point is foreground or background. For a background ray, all points on it belong to the background. For a foreground ray, at least one point on it is foreground.
  • Figure 4: Mask iteratively lifting illumination, in which $M_{o}$ represents masks generated by 3D occupancy field and $M_{SAM}$ indicates masks provided by SAM base on prompts. Given users' prompts of specific objects, the 3D occupancy field renders a coarse mask in another view, which leads to bad prompts for SAM and defective masks. But the 3D occupancy field lifts 2D masks from all views into 3D space and efficiently corrects its false judgments of voxels in other views. With the iterative training, $M_{o}$ and $M_{SAM}$ begin to shrink and finally converge to the same.
  • Figure 5: Qualitative comparison on DTU. Best viewed in colors.
  • ...and 1 more figures