Table of Contents
Fetching ...

Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange

Yanhao Wu, Tong Zhang, Wei Ke, Congpei Qiu, Sabine Susstrunk, Mathieu Salzmann

TL;DR

Indoor point-cloud SSL suffers from strong inter-object dependencies driven by human layouts. The authors propose OESSL, combining an object-exchange strategy that swaps similarly sized objects across scenes with a context-aware feature learning scheme guided by a joint objective $L_{total}=L_{context}+ \\lambda L_{op} + \\gamma L_{aux}$ (with $\\lambda=1$ and $\\gamma=2$). The method performs context-aware learning through two losses that align exchanged clusters (object patterns) and remaining clusters (context), plus an auxiliary relocation task to regularize relocation-aware features. Evaluations on ScanNet, S3DIS, and Synthia4D show consistent gains over prior SSL methods and strong transferability across datasets, highlighting improved robustness to contextual changes and better generalization for indoor scene understanding.

Abstract

In the realm of point cloud scene understanding, particularly in indoor scenes, objects are arranged following human habits, resulting in objects of certain semantics being closely positioned and displaying notable inter-object correlations. This can create a tendency for neural networks to exploit these strong dependencies, bypassing the individual object patterns. To address this challenge, we introduce a novel self-supervised learning (SSL) strategy. Our approach leverages both object patterns and contextual cues to produce robust features. It begins with the formulation of an object-exchanging strategy, where pairs of objects with comparable sizes are exchanged across different scenes, effectively disentangling the strong contextual dependencies. Subsequently, we introduce a context-aware feature learning strategy, which encodes object patterns without relying on their specific context by aggregating object features across various scenes. Our extensive experiments demonstrate the superiority of our method over existing SSL techniques, further showing its better robustness to environmental changes. Moreover, we showcase the applicability of our approach by transferring pre-trained models to diverse point cloud datasets.

Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange

TL;DR

Indoor point-cloud SSL suffers from strong inter-object dependencies driven by human layouts. The authors propose OESSL, combining an object-exchange strategy that swaps similarly sized objects across scenes with a context-aware feature learning scheme guided by a joint objective (with and ). The method performs context-aware learning through two losses that align exchanged clusters (object patterns) and remaining clusters (context), plus an auxiliary relocation task to regularize relocation-aware features. Evaluations on ScanNet, S3DIS, and Synthia4D show consistent gains over prior SSL methods and strong transferability across datasets, highlighting improved robustness to contextual changes and better generalization for indoor scene understanding.

Abstract

In the realm of point cloud scene understanding, particularly in indoor scenes, objects are arranged following human habits, resulting in objects of certain semantics being closely positioned and displaying notable inter-object correlations. This can create a tendency for neural networks to exploit these strong dependencies, bypassing the individual object patterns. To address this challenge, we introduce a novel self-supervised learning (SSL) strategy. Our approach leverages both object patterns and contextual cues to produce robust features. It begins with the formulation of an object-exchanging strategy, where pairs of objects with comparable sizes are exchanged across different scenes, effectively disentangling the strong contextual dependencies. Subsequently, we introduce a context-aware feature learning strategy, which encodes object patterns without relying on their specific context by aggregating object features across various scenes. Our extensive experiments demonstrate the superiority of our method over existing SSL techniques, further showing its better robustness to environmental changes. Moreover, we showcase the applicability of our approach by transferring pre-trained models to diverse point cloud datasets.
Paper Structure (12 sections, 6 equations, 8 figures, 13 tables)

This paper contains 12 sections, 6 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: (a) Visualization of semantic segmentation for edited scenes. We relocate objects to places where they appear less frequently. Our pre-trained model segments the relocated object accurately, while the pre-trained model from MSC MSC labels the objects incorrectly. (b) Bar chart depicting the semantic segmentation performance on ScanNet scannet with varying ratios of rearranged objects. The X-axis indicates the ratios of rearranged objects for each scene, and the Y-axis shows the mean Intersection over Union (mIoU) scores. The models are pre-trained and fine-tuned on ScanNet with 10% labels. We compare OESSL (Ours) with MSC MSC, DepthContrast depthcontrast, and training from scratch (weights are randomly initialized).
  • Figure 2: Overview of our OESSL. A. Given two randomly selected point clouds $P^{m}$ and $P^{n}$, we first perform clustering and generate minimum circumscribed boxes for every cluster. Clusters with similar circumscribed boxes are matched as cluster pairs. We exchange points of matched clusters and apply augmentation on $P^{m}$ and $P^{n}$ to generate novel views $\hat{P}^{m}$, $\hat{P}^{n}$, alongside two augmented views $\Bar{P}^{m}$ and $\Bar{P}^{n}$ without exchange. B. Every scene is passed through a feature extractor (Backbone) to obtain point-wise and cluster-wise features. C. We minimize the cluster feature distance obtained from the exchanged clusters in the different scenes (i.e.,$\Bar{P}^{m}$ and $\hat{P}^{n}$,$\Bar{P}^{n}$ and $\hat{P}^{m}$). D. We maximize the feature similarity between the remaining clusters in the augmented scenes (i.e.,$\Bar{P}^{m}$ and $\hat{P}^{m}$, $\Bar{P}^{n}$ and $\hat{P}^{n}$). E. The point-wise features are passed through a multilayer perceptron (MLP) to classify which points belong to the relocated objects. The cross-entropy loss is used for classification. $\tau_{1}$ and $\tau_{2}$ are data augmentations, such as random flipping and random clipping.
  • Figure 3: Segmentation results in scenes with objects relocated in unusual locations to eliminate contextual cues. We compare MSC MSC, OESSL (Ours), and training from scratch (without pre-training). The model pre-trained with our method better distinguishes the relocated objects, as shown in the highlighted area (colored circles).
  • Figure 4: Comparison of mIoU on ScanNet, after fine-tuning the models pre-trained with different $\beta$.
  • Figure 5: Affinity maps for the semantic classes in ScanNet scannet. Top: affinity map for the training set. Bottom: affinity map for the training set after object exchange.
  • ...and 3 more figures