SAS: Segment Any 3D Scene with Integrated 2D Priors
Zhuoyuan Li, Jiahao Lu, Jiacheng Deng, Hanzhi Chang, Lifan Wu, Yanzhe Liang, Tianzhu Zhang
TL;DR
SAS tackles open vocabulary 3D scene understanding by integrating multiple 2D foundation models through a four-stage pipeline: Model Alignment via Text aligns 2D model embeddings in a shared space using captions, Annotation-free Model Capability Construction quantifies each model's recognition capabilities with diffusion-based synthetic data, Feature Fusion merges aligned 2D features under capability guidance, and Distillation transfers the fused 2D knowledge to a 3D encoder. The approach yields strong zero-shot and long-tail performance across indoor and outdoor datasets (ScanNet v2, Matterport3D, nuScenes) and extends to 3D Gaussian and instance segmentation, demonstrating broad generalization. Key innovations include explicit modeling of 2D model capabilities via diffusion-driven synthetic imagery and a capability-guided fusion strategy, as well as a dual distillation regime combining superpoint and temporal ensembling self-distillation. These contributions offer a practical and scalable path to leveraging multiple 2D priors for robust 3D open vocabulary perception with significant performance gains over prior work.
Abstract
The open vocabulary capability of 3D models is increasingly valued, as traditional methods with models trained with fixed categories fail to recognize unseen objects in complex dynamic 3D scenes. In this paper, we propose a simple yet effective approach, SAS, to integrate the open vocabulary capability of multiple 2D models and migrate it to 3D domain. Specifically, we first propose Model Alignment via Text to map different 2D models into the same embedding space using text as a bridge. Then, we propose Annotation-Free Model Capability Construction to explicitly quantify the 2D model's capability of recognizing different categories using diffusion models. Following this, point cloud features from different 2D models are fused with the guide of constructed model capabilities. Finally, the integrated 2D open vocabulary capability is transferred to 3D domain through feature distillation. SAS outperforms previous methods by a large margin across multiple datasets, including ScanNet v2, Matterport3D, and nuScenes, while its generalizability is further validated on downstream tasks, e.g., gaussian segmentation and instance segmentation.
