Table of Contents
Fetching ...

Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds

Bin Yang, Mohamed Abdelsamad, Miao Zhang, Alexandru Paul Condurache

Abstract

Recent advances in self-supervised learning (SSL) for point clouds have substantially improved 3D scene understanding without human annotations. Existing approaches emphasize semantic awareness by enforcing feature consistency across augmented views or by masked scene modeling. However, the resulting representations transfer poorly to instance localization, and often require full finetuning for strong performance. Instance awareness is a fundamental component of 3D perception, thus bridging this gap is crucial for progressing toward true 3D foundation models that support all downstream tasks on 3D data. In this work, we introduce PointINS, an instance-oriented self-supervised framework that enriches point cloud representations through geometry-aware learning. PointINS employs an orthogonal offset branch to jointly learn high-level semantic understanding and geometric reasoning, yielding instance awareness. We identify two consistent properties essential for robust instance localization and formulate them as complementary regularization strategies, Offset Distribution Regularization (ODR), which aligns predicted offsets with empirically observed geometric priors, and Spatial Clustering Regularization (SCR), which enforces local coherence by regularizing offsets with pseudo-instance masks. Through extensive experiments across five datasets, PointINS achieves on average +3.5% mAP improvement for indoor instance segmentation and +4.1% PQ gain for outdoor panoptic segmentation, paving the way for scalable 3D foundation models.

Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds

Abstract

Recent advances in self-supervised learning (SSL) for point clouds have substantially improved 3D scene understanding without human annotations. Existing approaches emphasize semantic awareness by enforcing feature consistency across augmented views or by masked scene modeling. However, the resulting representations transfer poorly to instance localization, and often require full finetuning for strong performance. Instance awareness is a fundamental component of 3D perception, thus bridging this gap is crucial for progressing toward true 3D foundation models that support all downstream tasks on 3D data. In this work, we introduce PointINS, an instance-oriented self-supervised framework that enriches point cloud representations through geometry-aware learning. PointINS employs an orthogonal offset branch to jointly learn high-level semantic understanding and geometric reasoning, yielding instance awareness. We identify two consistent properties essential for robust instance localization and formulate them as complementary regularization strategies, Offset Distribution Regularization (ODR), which aligns predicted offsets with empirically observed geometric priors, and Spatial Clustering Regularization (SCR), which enforces local coherence by regularizing offsets with pseudo-instance masks. Through extensive experiments across five datasets, PointINS achieves on average +3.5% mAP improvement for indoor instance segmentation and +4.1% PQ gain for outdoor panoptic segmentation, paving the way for scalable 3D foundation models.

Paper Structure

This paper contains 12 sections, 11 equations, 8 figures, 13 tables, 2 algorithms.

Figures (8)

  • Figure 1: We achieve superior results over state-of-the-art self-supervised approaches (DOS abdelsamad2026dos and Sonata wu2025sonata) on both indoor instance (left) and outdoor panoptic segmentation (right).
  • Figure 2: Example of an indoor/outdoor scene and K-means clustering over their point features extracted from a self-supervised pre-trained model wu2025sonata.
  • Figure 3: Overview of PointINS: A point cloud is augmented to two independent views and they are randomly masked. The teacher processes the full input and the student receives only visible points. Both networks share the same architecture. In the semantic branch, features are computed similarity with prototypes $\mathcal{Q}$. A KL-divergence loss $L_{\text{sem}}$ is then applied for distillation. In the offset branch, an offset head maps features into 3D offset vectors. Teacher offsets are first regularized by ODR to align with empirically observed geometric priors. Next, segments obtained from K-means-clustering are used to extract pseudo-instance masks. Those masks help to enhance instance awareness by regularizing local coherence of points. Finally, an offset loss $L_{\text{off}}$ is computed as the second distillation.
  • Figure 4: Offset distributions of ScanNet dai2017scannet and nuScenes fong2022panoptic
  • Figure 5: We monitor the linear probing (LP) performance of a model pre-trained with a recent SSL method wu2025sonata. Remarkably, the model reaches 85% of its final LP performance within just 10% of the total training.
  • ...and 3 more figures