Table of Contents
Fetching ...

SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model

Dingyuan Zhang, Dingkang Liang, Hongcheng Yang, Zhikang Zou, Xiaoqing Ye, Zhe Liu, Xiang Bai

TL;DR

This work investigates zero-shot 3D object detection by leveraging the Segment Anything Model (SAM) in a BEV-based pipeline. By projecting LiDAR points into BEV, applying SAM for segmentation, and converting masks into 3D boxes via Mask2Box with LiDAR-based vertical refinement, the approach demonstrates the feasibility of extending vision foundation models to 3D perception. The study provides extensive ablations on BEV representations, prompt strategies, and post-processing, and compares against fully-supervised detectors, highlighting both the promise and current gaps of zero-shot 3D detection. The findings suggest a viable path toward integrating foundation models into 3D tasks, with future enhancements in multi-class handling, few-shot learning, and cross-modal prompting.

Abstract

With the development of large language models, many remarkable linguistic systems like ChatGPT have thrived and achieved astonishing success on many tasks, showing the incredible power of foundation models. In the spirit of unleashing the capability of foundation models on vision tasks, the Segment Anything Model (SAM), a vision foundation model for image segmentation, has been proposed recently and presents strong zero-shot ability on many downstream 2D tasks. However, whether SAM can be adapted to 3D vision tasks has yet to be explored, especially 3D object detection. With this inspiration, we explore adapting the zero-shot ability of SAM to 3D object detection in this paper. We propose a SAM-powered BEV processing pipeline to detect objects and get promising results on the large-scale Waymo open dataset. As an early attempt, our method takes a step toward 3D object detection with vision foundation models and presents the opportunity to unleash their power on 3D vision tasks. The code is released at https://github.com/DYZhang09/SAM3D.

SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model

TL;DR

This work investigates zero-shot 3D object detection by leveraging the Segment Anything Model (SAM) in a BEV-based pipeline. By projecting LiDAR points into BEV, applying SAM for segmentation, and converting masks into 3D boxes via Mask2Box with LiDAR-based vertical refinement, the approach demonstrates the feasibility of extending vision foundation models to 3D perception. The study provides extensive ablations on BEV representations, prompt strategies, and post-processing, and compares against fully-supervised detectors, highlighting both the promise and current gaps of zero-shot 3D detection. The findings suggest a viable path toward integrating foundation models into 3D tasks, with future enhancements in multi-class handling, few-shot learning, and cross-modal prompting.

Abstract

With the development of large language models, many remarkable linguistic systems like ChatGPT have thrived and achieved astonishing success on many tasks, showing the incredible power of foundation models. In the spirit of unleashing the capability of foundation models on vision tasks, the Segment Anything Model (SAM), a vision foundation model for image segmentation, has been proposed recently and presents strong zero-shot ability on many downstream 2D tasks. However, whether SAM can be adapted to 3D vision tasks has yet to be explored, especially 3D object detection. With this inspiration, we explore adapting the zero-shot ability of SAM to 3D object detection in this paper. We propose a SAM-powered BEV processing pipeline to detect objects and get promising results on the large-scale Waymo open dataset. As an early attempt, our method takes a step toward 3D object detection with vision foundation models and presents the opportunity to unleash their power on 3D vision tasks. The code is released at https://github.com/DYZhang09/SAM3D.
Paper Structure (26 sections, 15 equations, 3 figures, 3 tables)

This paper contains 26 sections, 15 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: (a) The overall framework of our method. We first project LiDAR points to colorful BEV images via a predefined palette, then post-process BEV images to better fit the requirements of SAM. After the segmentation, we post-process the noisy masks and finally predict 3D bounding boxes with the aid of LiDAR points. (b) The results of SAM3D using different versions of SAM. (c) The results of SAM3D using different pillar sizes. We report metrics of VEHICLE in the range [0,30) on Waymo validation set.
  • Figure 2: The visualizations of results from SAM3D. Each sub-figure corresponds to a single frame. The left side of each sub-figure is the visualization of 2D bounding boxes under the Bird's Eye View (BEV), and the right is the visualization of 3D bounding boxes.
  • Figure 3: The visualization of BEV images under different pillar size settings.