Table of Contents
Fetching ...

SAM3D: Segment Anything in 3D Scenes

Yunhan Yang, Xiaoyang Wu, Tong He, Hengshuang Zhao, Xihui Liu

TL;DR

SAM3D addresses 3D scene segmentation by repurposing 2D SAM masks for 3D point clouds without any training. It projects 2D masks into 3D space and iteratively merges partial masks across adjacent frames in a bidirectional, bottom-up fashion, optionally ensembling with geometry-based over-segmentation. The approach yields fine-grained, high-quality 3D masks on ScanNet without finetuning SAM, highlighting a practical route for open-vocabulary 3D segmentation. These results suggest SAM-based priors can serve as strong baselines for 3D segmentation tasks when coupled with robust multi-view fusion strategies.

Abstract

In this work, we propose SAM3D, a novel framework that is able to predict masks in 3D point clouds by leveraging the Segment-Anything Model (SAM) in RGB images without further training or finetuning. For a point cloud of a 3D scene with posed RGB images, we first predict segmentation masks of RGB images with SAM, and then project the 2D masks into the 3D points. Later, we merge the 3D masks iteratively with a bottom-up merging approach. At each step, we merge the point cloud masks of two adjacent frames with the bidirectional merging approach. In this way, the 3D masks predicted from different frames are gradually merged into the 3D masks of the whole 3D scene. Finally, we can optionally ensemble the result from our SAM3D with the over-segmentation results based on the geometric information of the 3D scenes. Our approach is experimented with ScanNet dataset and qualitative results demonstrate that our SAM3D achieves reasonable and fine-grained 3D segmentation results without any training or finetuning of SAM.

SAM3D: Segment Anything in 3D Scenes

TL;DR

SAM3D addresses 3D scene segmentation by repurposing 2D SAM masks for 3D point clouds without any training. It projects 2D masks into 3D space and iteratively merges partial masks across adjacent frames in a bidirectional, bottom-up fashion, optionally ensembling with geometry-based over-segmentation. The approach yields fine-grained, high-quality 3D masks on ScanNet without finetuning SAM, highlighting a practical route for open-vocabulary 3D segmentation. These results suggest SAM-based priors can serve as strong baselines for 3D segmentation tasks when coupled with robust multi-view fusion strategies.

Abstract

In this work, we propose SAM3D, a novel framework that is able to predict masks in 3D point clouds by leveraging the Segment-Anything Model (SAM) in RGB images without further training or finetuning. For a point cloud of a 3D scene with posed RGB images, we first predict segmentation masks of RGB images with SAM, and then project the 2D masks into the 3D points. Later, we merge the 3D masks iteratively with a bottom-up merging approach. At each step, we merge the point cloud masks of two adjacent frames with the bidirectional merging approach. In this way, the 3D masks predicted from different frames are gradually merged into the 3D masks of the whole 3D scene. Finally, we can optionally ensemble the result from our SAM3D with the over-segmentation results based on the geometric information of the 3D scenes. Our approach is experimented with ScanNet dataset and qualitative results demonstrate that our SAM3D achieves reasonable and fine-grained 3D segmentation results without any training or finetuning of SAM.
Paper Structure (8 sections, 2 equations, 5 figures)

This paper contains 8 sections, 2 equations, 5 figures.

Figures (5)

  • Figure 1: Qualitative results of our SAM3D. The first subfigure demonstrate the 3D scene input. The second subfigure is the segmentation masks predicted by SAM3D. The third subfigure is refined masks generated by ensembling the SAM result and the over-segmentation result. The last subfigure is the ground-truth segmentation labels in ScanNet dai2017scannet.
  • Figure 2: Overview. For input images, we first use SAM to generate 2D masks, and then map the 2D masks to 3D. Then we iteratively merge adjacent point clouds with the Bidirectional Merging (BM) approach until we obtain the 3D masks of the whole scene. We finally merge the SAM3D result with the over-segmentation masks to obtain an ensembled result.
  • Figure 3: Bottom-up Merging. We illustrate the process of gradually expanding the partial point cloud masks with the bottom-up merging approach.
  • Figure 4: Illustration of the Bidirectional Merging approach. We use different colors to denote different mask ids. Our bidirectional merging approach merges masks from different frames to obtain unified masks in the scene.
  • Figure 5: Qualitative Segmentation Results. Our approach generates high-quality instance masks at multiple scales. Different color represents group index only.