Table of Contents
Fetching ...

SAGOnline: Segment Any Gaussians Online

Wentao Sun, Quanyun Wu, Hanqing Xu, Kyle Gao, Zhengsen Xu, Yiping Chen, Dedong Zhang, Lingfei Ma, John S. Zelek, Jonathan Li

TL;DR

SAGOnline tackles the challenge of online, cross-view segmentation for 3D Gaussian Splatting scenes by decoupling the task into three lightweight sub-tasks: novel-view rendering, multi-view 2D video segmentation, and global 3D fusion. It introduces Rasterization-aware Geometric Consensus (RGC) to deterministically lift 2D predictions into discrete 3D labels using visibility and geometric alignment constraints, enabling immediate, feature-free 3D masks. A Stage II self-supervised refinement with a lightweight student network densifies and stabilizes the masks while preserving geometry, achieving 27 ms per-frame inference after refinement. On NVOS and SPIn-NeRF, SAGOnline sets new state-of-the-art mIoU scores (92.7% and 95.2%, respectively) with rapid, online performance, and demonstrates versatility across datasets and foundation models for instance, semantic, and prompt-based segmentation, making it well-suited for interactive AR/VR and robotics applications.

Abstract

3D Gaussian Splatting has emerged as a powerful paradigm for explicit 3D scene representation, yet achieving efficient and consistent 3D segmentation remains challenging. Existing segmentation approaches typically rely on high-dimensional feature lifting, which causes costly optimization, implicit semantics, and task-specific constraints. We present \textbf{Segment Any Gaussians Online (SAGOnline)}, a unified, zero-shot framework that achieves real-time, cross-view consistent segmentation without scene-specific training. SAGOnline decouples the monolithic segmentation problem into lightweight sub-tasks. By integrating video foundation models (e.g., SAM 2), we first generate temporally consistent 2D masks across rendered views. Crucially, instead of learning continuous feature fields, we introduce a \textbf{Rasterization-aware Geometric Consensus} mechanism that leverages the traceability of the Gaussian rasterization pipeline. This allows us to deterministically map 2D predictions to explicit, discrete 3D primitive labels in real-time. This discrete representation eliminates the memory and computational burden of feature distillation, enabling instant inference. Extensive evaluations on NVOS and SPIn-NeRF benchmarks demonstrate that SAGOnline achieves state-of-the-art accuracy (92.7\% and 95.2\% mIoU) while operating at the fastest speed at 27 ms per frame. By providing a flexible interface for diverse foundation models, our framework supports instant prompt, instance, and semantic segmentation, paving the way for interactive 3D understanding in AR/VR and robotics.

SAGOnline: Segment Any Gaussians Online

TL;DR

SAGOnline tackles the challenge of online, cross-view segmentation for 3D Gaussian Splatting scenes by decoupling the task into three lightweight sub-tasks: novel-view rendering, multi-view 2D video segmentation, and global 3D fusion. It introduces Rasterization-aware Geometric Consensus (RGC) to deterministically lift 2D predictions into discrete 3D labels using visibility and geometric alignment constraints, enabling immediate, feature-free 3D masks. A Stage II self-supervised refinement with a lightweight student network densifies and stabilizes the masks while preserving geometry, achieving 27 ms per-frame inference after refinement. On NVOS and SPIn-NeRF, SAGOnline sets new state-of-the-art mIoU scores (92.7% and 95.2%, respectively) with rapid, online performance, and demonstrates versatility across datasets and foundation models for instance, semantic, and prompt-based segmentation, making it well-suited for interactive AR/VR and robotics applications.

Abstract

3D Gaussian Splatting has emerged as a powerful paradigm for explicit 3D scene representation, yet achieving efficient and consistent 3D segmentation remains challenging. Existing segmentation approaches typically rely on high-dimensional feature lifting, which causes costly optimization, implicit semantics, and task-specific constraints. We present \textbf{Segment Any Gaussians Online (SAGOnline)}, a unified, zero-shot framework that achieves real-time, cross-view consistent segmentation without scene-specific training. SAGOnline decouples the monolithic segmentation problem into lightweight sub-tasks. By integrating video foundation models (e.g., SAM 2), we first generate temporally consistent 2D masks across rendered views. Crucially, instead of learning continuous feature fields, we introduce a \textbf{Rasterization-aware Geometric Consensus} mechanism that leverages the traceability of the Gaussian rasterization pipeline. This allows us to deterministically map 2D predictions to explicit, discrete 3D primitive labels in real-time. This discrete representation eliminates the memory and computational burden of feature distillation, enabling instant inference. Extensive evaluations on NVOS and SPIn-NeRF benchmarks demonstrate that SAGOnline achieves state-of-the-art accuracy (92.7\% and 95.2\% mIoU) while operating at the fastest speed at 27 ms per frame. By providing a flexible interface for diverse foundation models, our framework supports instant prompt, instance, and semantic segmentation, paving the way for interactive 3D understanding in AR/VR and robotics.

Paper Structure

This paper contains 25 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The architecture of SAGOnline. The framework comprises an initialization stage (Left) and a real-time inference stage (Right). (A) We propose a Rasterization-aware Geometric Consensus (RAGC) module to resolve semantic ambiguities. By analyzing the pixel ray, RAGC identifies valid Gaussians within the surface crust (green zone) that are geometrically aligned with the pixel center, aggregating 2D semantics via majority voting. (B) To achieve real-time segmentation, the learned sparse semantic Gaussians are projected to form a coarse mask. This coarse prior is fused with the photorealistic RGB render in a Dual-Branch Refinement Network, employing an encoder-decoder structure with channel attention to recover fine-grained details, supervised by online distillation.
  • Figure 2: Qualitative segmentation results demonstrating multi-view consistency. We present target object extraction results for four diverse scenes: Fork, Fortress, Horns, and Truck. For each block, the input scene is shown in the 1st column, followed by the extracted binary masks rendered from three distinct viewpoints (Render 0-2). The results highlight our method's ability to maintain precise object boundaries and geometric coherence across varying camera poses.
  • Figure 3: Qualitative results of multi-object segmentation across diverse scenes. The figure is organized into four scene blocks: Desktop, Cars, Teatime, and Counter. For each block, the 1st column displays the reference scene; the 2nd column visualizes the explicit 3D masks formed by the Gaussian primitive means; and the subsequent columns (3rd-4th) present the segmentation masks rendered from different viewpoints. Consistent colors across views indicate that our method maintains robust instance identity.
  • Figure 4: Qualitative results on diverse segmentation tasks. The top row demonstrates automatic vehicle instance segmentation on the KITTI-360 dataset utilizing YOLO. The bottom row showcases open-vocabulary segmentation on the UDD dataset driven by SAM 3 with text prompts. These results highlight our framework's versatility, capable of leveraging distinct backbone models to handle various segmentation paradigms within 3D Gaussian Splatting scenes.
  • Figure 5: Visual comparison of the Dual-Branch Refinement Network. (a) Segmentation results derived directly from Sparse Semantic 3D Gaussians exhibit noticeable sparsity and noise artifacts. (b) After applying our refinement module, the segmentation masks become significantly denser, smoother, and spatially coherent, demonstrating the network's effectiveness in enhancing mask quality.
  • ...and 1 more figures