MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors
Zhipeng Du, Duolikun Danier, Jan Eric Lenssen, Hakan Bilen
TL;DR
MoonSeg3R introduces a monocular online zero-shot framework for 3D instance segmentation by leveraging reconstructive priors from CUT3R and VFM-derived 2D masks to produce discriminative 3D queries. It adds self-supervised query refinement with spatial-semantic distillation, a 3D Query Index Memory for cross-frame consistency, and a state distribution token to robustly fuse masks online. The method achieves competitive performance on ScanNet200 and SceneNN without depth supervision, outperforming monocular baselines and approaching RGB-D methods with faster inference. It highlights the value of combining geometric priors from RFMs with semantic cues from VFMs for real-time 3D perception, while noting limitations from accumulation of RFM errors over long sequences.
Abstract
In this paper, we focus on online zero-shot monocular 3D instance segmentation, a novel practical setting where existing approaches fail to perform because they rely on posed RGB-D sequences. To overcome this limitation, we leverage CUT3R, a recent Reconstructive Foundation Model (RFM), to provide reliable geometric priors from a single RGB stream. We propose MoonSeg3R, which introduces three key components: (1) a self-supervised query refinement module with spatial-semantic distillation that transforms segmentation masks from 2D visual foundation models (VFMs) into discriminative 3D queries; (2) a 3D query index memory that provides temporal consistency by retrieving contextual queries; and (3) a state-distribution token from CUT3R that acts as a mask identity descriptor to strengthen cross-frame fusion. Experiments on ScanNet200 and SceneNN show that MoonSeg3R is the first method to enable online monocular 3D segmentation and achieves performance competitive with state-of-the-art RGB-D-based systems. Code and models will be released.
