Table of Contents
Fetching ...

MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors

Zhipeng Du, Duolikun Danier, Jan Eric Lenssen, Hakan Bilen

TL;DR

MoonSeg3R introduces a monocular online zero-shot framework for 3D instance segmentation by leveraging reconstructive priors from CUT3R and VFM-derived 2D masks to produce discriminative 3D queries. It adds self-supervised query refinement with spatial-semantic distillation, a 3D Query Index Memory for cross-frame consistency, and a state distribution token to robustly fuse masks online. The method achieves competitive performance on ScanNet200 and SceneNN without depth supervision, outperforming monocular baselines and approaching RGB-D methods with faster inference. It highlights the value of combining geometric priors from RFMs with semantic cues from VFMs for real-time 3D perception, while noting limitations from accumulation of RFM errors over long sequences.

Abstract

In this paper, we focus on online zero-shot monocular 3D instance segmentation, a novel practical setting where existing approaches fail to perform because they rely on posed RGB-D sequences. To overcome this limitation, we leverage CUT3R, a recent Reconstructive Foundation Model (RFM), to provide reliable geometric priors from a single RGB stream. We propose MoonSeg3R, which introduces three key components: (1) a self-supervised query refinement module with spatial-semantic distillation that transforms segmentation masks from 2D visual foundation models (VFMs) into discriminative 3D queries; (2) a 3D query index memory that provides temporal consistency by retrieving contextual queries; and (3) a state-distribution token from CUT3R that acts as a mask identity descriptor to strengthen cross-frame fusion. Experiments on ScanNet200 and SceneNN show that MoonSeg3R is the first method to enable online monocular 3D segmentation and achieves performance competitive with state-of-the-art RGB-D-based systems. Code and models will be released.

MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors

TL;DR

MoonSeg3R introduces a monocular online zero-shot framework for 3D instance segmentation by leveraging reconstructive priors from CUT3R and VFM-derived 2D masks to produce discriminative 3D queries. It adds self-supervised query refinement with spatial-semantic distillation, a 3D Query Index Memory for cross-frame consistency, and a state distribution token to robustly fuse masks online. The method achieves competitive performance on ScanNet200 and SceneNN without depth supervision, outperforming monocular baselines and approaching RGB-D methods with faster inference. It highlights the value of combining geometric priors from RFMs with semantic cues from VFMs for real-time 3D perception, while noting limitations from accumulation of RFM errors over long sequences.

Abstract

In this paper, we focus on online zero-shot monocular 3D instance segmentation, a novel practical setting where existing approaches fail to perform because they rely on posed RGB-D sequences. To overcome this limitation, we leverage CUT3R, a recent Reconstructive Foundation Model (RFM), to provide reliable geometric priors from a single RGB stream. We propose MoonSeg3R, which introduces three key components: (1) a self-supervised query refinement module with spatial-semantic distillation that transforms segmentation masks from 2D visual foundation models (VFMs) into discriminative 3D queries; (2) a 3D query index memory that provides temporal consistency by retrieving contextual queries; and (3) a state-distribution token from CUT3R that acts as a mask identity descriptor to strengthen cross-frame fusion. Experiments on ScanNet200 and SceneNN show that MoonSeg3R is the first method to enable online monocular 3D segmentation and achieves performance competitive with state-of-the-art RGB-D-based systems. Code and models will be released.

Paper Structure

This paper contains 15 sections, 14 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Previous VFM-assistend Online Paradigm v.s. Ours. While existing methods relies on the ground truth geometry (and 3D segmentation masks), our method works in a monocular online zero-shot setting, exploiting the spatio-temporal priors from an RFM to help with online 3D segmentation, thereby simultaneously achieving online reconstruction and segmentation.
  • Figure 2: Overview of MoonSeg3R. The pipeline consists of four steps. (a) CUT3R takes an uncalibrated image $\textbf{I}_t$ as input to predict explicit geometry (pose $\textbf{P}_t$, world-coordinate pointmap $\textbf{X}_t$), and implicit representations (geometric features $\textbf{F}^{3d}_t$, state attention $\textbf{A}_t$). (b) VFM masks $\textbf{M}_t$ are lifted and refined into 3D queries $\textbf{q}'_t$ through a transformer decoder, via spatial-semantic self-distillation supervision ($\mathcal{L}_{dist}$, $\mathcal{L}_{seg}$) (\ref{['sec:feature_refinement']}). (c) In parallel, we utilize $\textbf{P}_t$ to rasterize our explicit 3D query index memory $\mathcal{M}_{t-1}$, efficiently retrieving relevant historical queries from query bank $\mathcal{Q}_{t-1}$ for contextual query injection into query refinement process and cross-frame supervision ($\mathcal{L}_{xseg}$). The memory and bank are then updated using $\textbf{X}_t$ and $\textbf{q}'_t$, respectively (\ref{['sec:memory']}). (d) During inference, we first merge over-segmented instances and then perform bipartite matching, utilizing our novel state distribution token derived from state attention $\textbf{A}_t$ to enhance association robustness (\ref{['sec:merge']}).
  • Figure 3: Qualitative Comparison. Qualitative examples of OnlineAnySeg-M and our method on ScanNet200 sequences. These results visually demonstrate that MoonSeg3R achieves superior instance segmentation. OnlineAnySeg-M, in contrast, tends to fail in associating masks, which leaves significant unsegmented areas, as shown in the red dashed circles. The segmentation results are unprojected to ground truth point cloud for visualization.
  • Figure 4: Distilled Feature Visualization. Top row: The original images. Middle row: Reference features trained only with self-supervision. These features show a fixed spatial pattern (purple to yellow) that is not correlated with the actual spatial location. Bottom row: Features trained with spatial-semantic distillation. This strategy mitigates the feature degradation by preserving essential structural patterns from the foundation models. The resulting features are object-aware, as the bag area remains consistent across views, while the locker features properly reflect the 3D spatial variation compared to other views.
  • Figure 5: State Distribution Similarity. For two consecutive frames, we extract the state distribution tokens for all instances and compute their cross-frame pairwise similarities. Tokens belonging to the same instances always exhibit the highest similarity scores, both for large, fully-visible objects (sofa) and small, partially observed objects (table).