SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model
Dingyuan Zhang, Dingkang Liang, Hongcheng Yang, Zhikang Zou, Xiaoqing Ye, Zhe Liu, Xiang Bai
TL;DR
This work investigates zero-shot 3D object detection by leveraging the Segment Anything Model (SAM) in a BEV-based pipeline. By projecting LiDAR points into BEV, applying SAM for segmentation, and converting masks into 3D boxes via Mask2Box with LiDAR-based vertical refinement, the approach demonstrates the feasibility of extending vision foundation models to 3D perception. The study provides extensive ablations on BEV representations, prompt strategies, and post-processing, and compares against fully-supervised detectors, highlighting both the promise and current gaps of zero-shot 3D detection. The findings suggest a viable path toward integrating foundation models into 3D tasks, with future enhancements in multi-class handling, few-shot learning, and cross-modal prompting.
Abstract
With the development of large language models, many remarkable linguistic systems like ChatGPT have thrived and achieved astonishing success on many tasks, showing the incredible power of foundation models. In the spirit of unleashing the capability of foundation models on vision tasks, the Segment Anything Model (SAM), a vision foundation model for image segmentation, has been proposed recently and presents strong zero-shot ability on many downstream 2D tasks. However, whether SAM can be adapted to 3D vision tasks has yet to be explored, especially 3D object detection. With this inspiration, we explore adapting the zero-shot ability of SAM to 3D object detection in this paper. We propose a SAM-powered BEV processing pipeline to detect objects and get promising results on the large-scale Waymo open dataset. As an early attempt, our method takes a step toward 3D object detection with vision foundation models and presents the opportunity to unleash their power on 3D vision tasks. The code is released at https://github.com/DYZhang09/SAM3D.
