Table of Contents
Fetching ...

Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding

Pedro Hermosilla, Christian Stippel, Leon Sick

TL;DR

This work addresses the gap in 3D scene understanding where self-supervised representations are rarely usable off-the-shelf. It introduces a principled evaluation protocol that preserves hierarchical information from 3D encoders and decoders, enabling reliable linear probing and nearest-neighbor assessments. The main contribution is Masked Scene Modeling (MSM), a 3D-native self-supervised objective that reconstructs deep features of masked patches from a teacher, using bottom-up hierarchical reconstruction and cross-view objectives. Empirically, MSM delivers supervised-like performance across semantic segmentation, instance segmentation, and 3D visual grounding, outperforming existing 3D SSL methods and even rivaling 2D foundation models when applied to 3D tasks, especially in low-annotation regimes. The approach paves the way for robust, general-purpose 3D representations and highlights the value of 3D-native self-supervised learning with hierarchical, feature-level supervision.

Abstract

Self-supervised learning has transformed 2D computer vision by enabling models trained on large, unannotated datasets to provide versatile off-the-shelf features that perform similarly to models trained with labels. However, in 3D scene understanding, self-supervised methods are typically only used as a weight initialization step for task-specific fine-tuning, limiting their utility for general-purpose feature extraction. This paper addresses this shortcoming by proposing a robust evaluation protocol specifically designed to assess the quality of self-supervised features for 3D scene understanding. Our protocol uses multi-resolution feature sampling of hierarchical models to create rich point-level representations that capture the semantic capabilities of the model and, hence, are suitable for evaluation with linear probing and nearest-neighbor methods. Furthermore, we introduce the first self-supervised model that performs similarly to supervised models when only off-the-shelf features are used in a linear probing setup. In particular, our model is trained natively in 3D with a novel self-supervised approach based on a Masked Scene Modeling objective, which reconstructs deep features of masked patches in a bottom-up manner and is specifically tailored to hierarchical 3D models. Our experiments not only demonstrate that our method achieves competitive performance to supervised models, but also surpasses existing self-supervised approaches by a large margin. The model and training code can be found at our Github repository (https://github.com/phermosilla/msm).

Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding

TL;DR

This work addresses the gap in 3D scene understanding where self-supervised representations are rarely usable off-the-shelf. It introduces a principled evaluation protocol that preserves hierarchical information from 3D encoders and decoders, enabling reliable linear probing and nearest-neighbor assessments. The main contribution is Masked Scene Modeling (MSM), a 3D-native self-supervised objective that reconstructs deep features of masked patches from a teacher, using bottom-up hierarchical reconstruction and cross-view objectives. Empirically, MSM delivers supervised-like performance across semantic segmentation, instance segmentation, and 3D visual grounding, outperforming existing 3D SSL methods and even rivaling 2D foundation models when applied to 3D tasks, especially in low-annotation regimes. The approach paves the way for robust, general-purpose 3D representations and highlights the value of 3D-native self-supervised learning with hierarchical, feature-level supervision.

Abstract

Self-supervised learning has transformed 2D computer vision by enabling models trained on large, unannotated datasets to provide versatile off-the-shelf features that perform similarly to models trained with labels. However, in 3D scene understanding, self-supervised methods are typically only used as a weight initialization step for task-specific fine-tuning, limiting their utility for general-purpose feature extraction. This paper addresses this shortcoming by proposing a robust evaluation protocol specifically designed to assess the quality of self-supervised features for 3D scene understanding. Our protocol uses multi-resolution feature sampling of hierarchical models to create rich point-level representations that capture the semantic capabilities of the model and, hence, are suitable for evaluation with linear probing and nearest-neighbor methods. Furthermore, we introduce the first self-supervised model that performs similarly to supervised models when only off-the-shelf features are used in a linear probing setup. In particular, our model is trained natively in 3D with a novel self-supervised approach based on a Masked Scene Modeling objective, which reconstructs deep features of masked patches in a bottom-up manner and is specifically tailored to hierarchical 3D models. Our experiments not only demonstrate that our method achieves competitive performance to supervised models, but also surpasses existing self-supervised approaches by a large margin. The model and training code can be found at our Github repository (https://github.com/phermosilla/msm).

Paper Structure

This paper contains 65 sections, 2 equations, 9 figures, 19 tables.

Figures (9)

  • Figure 1: Self-Supervised Feature Visualization using PCA. We reduce the point features obtained with our self-supervised model to three dimensions using PCA and visualize them as colors. Features learned by our model are semantic-aware, which is visible from the color separation: Similar objects result in similar features, such as the sofas in the first figure or the chairs in the last one, while different objects result in different features, such as the counter and the tables in the second image or the crib and the curtains in the third one.
  • Figure 2: Pilot study. Our hierarchical features uncover better performance in all self-supervised models. Moreover, our study shows that existing approaches exhibit a large performance gap between supervised and self-supervised training.
  • Figure 3: Hierarchical features
  • Figure 4: Overview. Our method receives as input a 3D scene represented as a pointcloud, (a). The scene is voxelized into two different views, (b), and then further cropped and masked, (c). The student model first encodes the cropped views and then adds the masked voxels with a learnable token, (d). The decoder processes the cropped views and reconstructs deep features of the masked tokens, (e). The loss is computed in a cross-view manner where the target features, (f), are obtained from a teacher model updated with EMA.
  • Figure 5: Hierarchical reconstruction. The masked voxelization is processed by our hierarchical encoder. The decoder processes the encoded features in a bottom-up manner by first including the masked voxels with a learnable token. Each level is used in the loss computation before the decoded features are upscaled and combined with the skip connection from the previous level.
  • ...and 4 more figures