Table of Contents
Fetching ...

SAS: Segment Any 3D Scene with Integrated 2D Priors

Zhuoyuan Li, Jiahao Lu, Jiacheng Deng, Hanzhi Chang, Lifan Wu, Yanzhe Liang, Tianzhu Zhang

TL;DR

SAS tackles open vocabulary 3D scene understanding by integrating multiple 2D foundation models through a four-stage pipeline: Model Alignment via Text aligns 2D model embeddings in a shared space using captions, Annotation-free Model Capability Construction quantifies each model's recognition capabilities with diffusion-based synthetic data, Feature Fusion merges aligned 2D features under capability guidance, and Distillation transfers the fused 2D knowledge to a 3D encoder. The approach yields strong zero-shot and long-tail performance across indoor and outdoor datasets (ScanNet v2, Matterport3D, nuScenes) and extends to 3D Gaussian and instance segmentation, demonstrating broad generalization. Key innovations include explicit modeling of 2D model capabilities via diffusion-driven synthetic imagery and a capability-guided fusion strategy, as well as a dual distillation regime combining superpoint and temporal ensembling self-distillation. These contributions offer a practical and scalable path to leveraging multiple 2D priors for robust 3D open vocabulary perception with significant performance gains over prior work.

Abstract

The open vocabulary capability of 3D models is increasingly valued, as traditional methods with models trained with fixed categories fail to recognize unseen objects in complex dynamic 3D scenes. In this paper, we propose a simple yet effective approach, SAS, to integrate the open vocabulary capability of multiple 2D models and migrate it to 3D domain. Specifically, we first propose Model Alignment via Text to map different 2D models into the same embedding space using text as a bridge. Then, we propose Annotation-Free Model Capability Construction to explicitly quantify the 2D model's capability of recognizing different categories using diffusion models. Following this, point cloud features from different 2D models are fused with the guide of constructed model capabilities. Finally, the integrated 2D open vocabulary capability is transferred to 3D domain through feature distillation. SAS outperforms previous methods by a large margin across multiple datasets, including ScanNet v2, Matterport3D, and nuScenes, while its generalizability is further validated on downstream tasks, e.g., gaussian segmentation and instance segmentation.

SAS: Segment Any 3D Scene with Integrated 2D Priors

TL;DR

SAS tackles open vocabulary 3D scene understanding by integrating multiple 2D foundation models through a four-stage pipeline: Model Alignment via Text aligns 2D model embeddings in a shared space using captions, Annotation-free Model Capability Construction quantifies each model's recognition capabilities with diffusion-based synthetic data, Feature Fusion merges aligned 2D features under capability guidance, and Distillation transfers the fused 2D knowledge to a 3D encoder. The approach yields strong zero-shot and long-tail performance across indoor and outdoor datasets (ScanNet v2, Matterport3D, nuScenes) and extends to 3D Gaussian and instance segmentation, demonstrating broad generalization. Key innovations include explicit modeling of 2D model capabilities via diffusion-driven synthetic imagery and a capability-guided fusion strategy, as well as a dual distillation regime combining superpoint and temporal ensembling self-distillation. These contributions offer a practical and scalable path to leveraging multiple 2D priors for robust 3D open vocabulary perception with significant performance gains over prior work.

Abstract

The open vocabulary capability of 3D models is increasingly valued, as traditional methods with models trained with fixed categories fail to recognize unseen objects in complex dynamic 3D scenes. In this paper, we propose a simple yet effective approach, SAS, to integrate the open vocabulary capability of multiple 2D models and migrate it to 3D domain. Specifically, we first propose Model Alignment via Text to map different 2D models into the same embedding space using text as a bridge. Then, we propose Annotation-Free Model Capability Construction to explicitly quantify the 2D model's capability of recognizing different categories using diffusion models. Following this, point cloud features from different 2D models are fused with the guide of constructed model capabilities. Finally, the integrated 2D open vocabulary capability is transferred to 3D domain through feature distillation. SAS outperforms previous methods by a large margin across multiple datasets, including ScanNet v2, Matterport3D, and nuScenes, while its generalizability is further validated on downstream tasks, e.g., gaussian segmentation and instance segmentation.

Paper Structure

This paper contains 43 sections, 13 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Left: The leading 2D open vocabulary models like LSeg lseg and SEEM SEEM often misidentify objects, which makes the distilled 3D model perform the same misidentification. Middle: Our proposed SAS successfully correct the misidentified object. Right: SAS distills open vocabulary knowledge from multiple 2D models with novel designs, e.g., Annotation-free Model Capability Construction.
  • Figure 2: Overview of our proposed SAS. SAS first align features of different models in a unified embedding space (Sec. \ref{['sec:3.1']}). Then SAS constructs models' capability to recognize various objects (Sec. \ref{['sec:3.2']}). With the constructed capability as guide, features from different 2D models are integrated (Sec. \ref{['sec:3.3']}). Finally, a 3D network is distilled to enable 3D open vocabulary understanding (Sec. \ref{['sec:3.4']}).
  • Figure 3: Overview of Model Alignment via Text. Features from different models are first aligned on text level, which are then encoded by a shared text encoder to produce aligned features.
  • Figure 4: Overview of Annotation-free Model Capability Construction. Stable Diffusion model stablediffusion is utilized to generate synthesized images with masks computed by SAM sam. By assessing model's performance on synthesized images, we construct model capabilities.
  • Figure 5: Visualization results. Semantic segmentation results of SAS on ScanNet v2.
  • ...and 4 more figures