SAGOnline: Segment Any Gaussians Online
Wentao Sun, Quanyun Wu, Hanqing Xu, Kyle Gao, Zhengsen Xu, Yiping Chen, Dedong Zhang, Lingfei Ma, John S. Zelek, Jonathan Li
TL;DR
SAGOnline tackles the challenge of online, cross-view segmentation for 3D Gaussian Splatting scenes by decoupling the task into three lightweight sub-tasks: novel-view rendering, multi-view 2D video segmentation, and global 3D fusion. It introduces Rasterization-aware Geometric Consensus (RGC) to deterministically lift 2D predictions into discrete 3D labels using visibility and geometric alignment constraints, enabling immediate, feature-free 3D masks. A Stage II self-supervised refinement with a lightweight student network densifies and stabilizes the masks while preserving geometry, achieving 27 ms per-frame inference after refinement. On NVOS and SPIn-NeRF, SAGOnline sets new state-of-the-art mIoU scores (92.7% and 95.2%, respectively) with rapid, online performance, and demonstrates versatility across datasets and foundation models for instance, semantic, and prompt-based segmentation, making it well-suited for interactive AR/VR and robotics applications.
Abstract
3D Gaussian Splatting has emerged as a powerful paradigm for explicit 3D scene representation, yet achieving efficient and consistent 3D segmentation remains challenging. Existing segmentation approaches typically rely on high-dimensional feature lifting, which causes costly optimization, implicit semantics, and task-specific constraints. We present \textbf{Segment Any Gaussians Online (SAGOnline)}, a unified, zero-shot framework that achieves real-time, cross-view consistent segmentation without scene-specific training. SAGOnline decouples the monolithic segmentation problem into lightweight sub-tasks. By integrating video foundation models (e.g., SAM 2), we first generate temporally consistent 2D masks across rendered views. Crucially, instead of learning continuous feature fields, we introduce a \textbf{Rasterization-aware Geometric Consensus} mechanism that leverages the traceability of the Gaussian rasterization pipeline. This allows us to deterministically map 2D predictions to explicit, discrete 3D primitive labels in real-time. This discrete representation eliminates the memory and computational burden of feature distillation, enabling instant inference. Extensive evaluations on NVOS and SPIn-NeRF benchmarks demonstrate that SAGOnline achieves state-of-the-art accuracy (92.7\% and 95.2\% mIoU) while operating at the fastest speed at 27 ms per frame. By providing a flexible interface for diverse foundation models, our framework supports instant prompt, instance, and semantic segmentation, paving the way for interactive 3D understanding in AR/VR and robotics.
