Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding

Pedro Hermosilla; Christian Stippel; Leon Sick

Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding

Pedro Hermosilla, Christian Stippel, Leon Sick

TL;DR

This work addresses the gap in 3D scene understanding where self-supervised representations are rarely usable off-the-shelf. It introduces a principled evaluation protocol that preserves hierarchical information from 3D encoders and decoders, enabling reliable linear probing and nearest-neighbor assessments. The main contribution is Masked Scene Modeling (MSM), a 3D-native self-supervised objective that reconstructs deep features of masked patches from a teacher, using bottom-up hierarchical reconstruction and cross-view objectives. Empirically, MSM delivers supervised-like performance across semantic segmentation, instance segmentation, and 3D visual grounding, outperforming existing 3D SSL methods and even rivaling 2D foundation models when applied to 3D tasks, especially in low-annotation regimes. The approach paves the way for robust, general-purpose 3D representations and highlights the value of 3D-native self-supervised learning with hierarchical, feature-level supervision.

Abstract

Self-supervised learning has transformed 2D computer vision by enabling models trained on large, unannotated datasets to provide versatile off-the-shelf features that perform similarly to models trained with labels. However, in 3D scene understanding, self-supervised methods are typically only used as a weight initialization step for task-specific fine-tuning, limiting their utility for general-purpose feature extraction. This paper addresses this shortcoming by proposing a robust evaluation protocol specifically designed to assess the quality of self-supervised features for 3D scene understanding. Our protocol uses multi-resolution feature sampling of hierarchical models to create rich point-level representations that capture the semantic capabilities of the model and, hence, are suitable for evaluation with linear probing and nearest-neighbor methods. Furthermore, we introduce the first self-supervised model that performs similarly to supervised models when only off-the-shelf features are used in a linear probing setup. In particular, our model is trained natively in 3D with a novel self-supervised approach based on a Masked Scene Modeling objective, which reconstructs deep features of masked patches in a bottom-up manner and is specifically tailored to hierarchical 3D models. Our experiments not only demonstrate that our method achieves competitive performance to supervised models, but also surpasses existing self-supervised approaches by a large margin. The model and training code can be found at our Github repository (https://github.com/phermosilla/msm).

Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding

TL;DR

Abstract

Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)