Table of Contents
Fetching ...

Dual-Branch Center-Surrounding Contrast: Rethinking Contrastive Learning for 3D Point Clouds

Shaofeng Zhang, Xuanqi Chen, Xiangdong Zhang, Sitong Wu, Junchi Yan

TL;DR

The paper targets the limitation of generative MAE-based SSL for 3D point clouds in learning high-level discriminative features, proposing a contrastive-only approach tailored to 3D geometry.CSCon introduces a dual-branch center-surrounding masking scheme and a novel inner-instance patch-level contrastive loss, all operating without a decoder and with shared encoder parameters.Extensive experiments on ShapeNet and downstream tasks (ScanObjectNN, ModelNet40, ShapeNetPart, S3DIS) show CSCon achieving state-of-the-art results under several protocols, with notable gains over baselines like Point-MAE.Ablation studies substantiate the importance of center-surrounding positives, inner-instance loss, parameter sharing, and masking strategies, demonstrating CSCon’s effectiveness in capturing both global and local 3D structure.

Abstract

Most existing self-supervised learning (SSL) approaches for 3D point clouds are dominated by generative methods based on Masked Autoencoders (MAE). However, these generative methods have been proven to struggle to capture high-level discriminative features effectively, leading to poor performance on linear probing and other downstream tasks. In contrast, contrastive methods excel in discriminative feature representation and generalization ability on image data. Despite this, contrastive learning (CL) in 3D data remains scarce. Besides, simply applying CL methods designed for 2D data to 3D fails to effectively learn 3D local details. To address these challenges, we propose a novel Dual-Branch \textbf{C}enter-\textbf{S}urrounding \textbf{Con}trast (CSCon) framework. Specifically, we apply masking to the center and surrounding parts separately, constructing dual-branch inputs with center-biased and surrounding-biased representations to better capture rich geometric information. Meanwhile, we introduce a patch-level contrastive loss to further enhance both high-level information and local sensitivity. Under the FULL and ALL protocols, CSCon achieves performance comparable to generative methods; under the MLP-LINEAR, MLP-3, and ONLY-NEW protocols, our method attains state-of-the-art results, even surpassing cross-modal approaches. In particular, under the MLP-LINEAR protocol, our method outperforms the baseline (Point-MAE) by \textbf{7.9\%}, \textbf{6.7\%}, and \textbf{10.3\%} on the three variants of ScanObjectNN, respectively. The code will be made publicly available.

Dual-Branch Center-Surrounding Contrast: Rethinking Contrastive Learning for 3D Point Clouds

TL;DR

The paper targets the limitation of generative MAE-based SSL for 3D point clouds in learning high-level discriminative features, proposing a contrastive-only approach tailored to 3D geometry.CSCon introduces a dual-branch center-surrounding masking scheme and a novel inner-instance patch-level contrastive loss, all operating without a decoder and with shared encoder parameters.Extensive experiments on ShapeNet and downstream tasks (ScanObjectNN, ModelNet40, ShapeNetPart, S3DIS) show CSCon achieving state-of-the-art results under several protocols, with notable gains over baselines like Point-MAE.Ablation studies substantiate the importance of center-surrounding positives, inner-instance loss, parameter sharing, and masking strategies, demonstrating CSCon’s effectiveness in capturing both global and local 3D structure.

Abstract

Most existing self-supervised learning (SSL) approaches for 3D point clouds are dominated by generative methods based on Masked Autoencoders (MAE). However, these generative methods have been proven to struggle to capture high-level discriminative features effectively, leading to poor performance on linear probing and other downstream tasks. In contrast, contrastive methods excel in discriminative feature representation and generalization ability on image data. Despite this, contrastive learning (CL) in 3D data remains scarce. Besides, simply applying CL methods designed for 2D data to 3D fails to effectively learn 3D local details. To address these challenges, we propose a novel Dual-Branch \textbf{C}enter-\textbf{S}urrounding \textbf{Con}trast (CSCon) framework. Specifically, we apply masking to the center and surrounding parts separately, constructing dual-branch inputs with center-biased and surrounding-biased representations to better capture rich geometric information. Meanwhile, we introduce a patch-level contrastive loss to further enhance both high-level information and local sensitivity. Under the FULL and ALL protocols, CSCon achieves performance comparable to generative methods; under the MLP-LINEAR, MLP-3, and ONLY-NEW protocols, our method attains state-of-the-art results, even surpassing cross-modal approaches. In particular, under the MLP-LINEAR protocol, our method outperforms the baseline (Point-MAE) by \textbf{7.9\%}, \textbf{6.7\%}, and \textbf{10.3\%} on the three variants of ScanObjectNN, respectively. The code will be made publicly available.

Paper Structure

This paper contains 12 sections, 7 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Illustration of the framework of the proposed CSCon. Both the transformer blocks and the projector share the parameters in two branches. Note that the point patches $\mathbf{P}$ only contain local information, and the center positions (absolute xyz coordinate of the center points) are discarded.
  • Figure 2: Visualization on the ShapeNet validation set. The leftmost column shows the ground truth, while the subsequent columns present dual-branch inputs under different masking ratios: the left side displays the masked surrounding points, and the right side shows the masked center points. Note that when the center is masked, the patch loses its global coordinates and retains only the local features from the normalized surrounding points. Therefore, for patches with masked center, we assign randomly generated center coordinates composed of noise for better visualization.
  • Figure 3: t-SNE JMLR:v9:vandermaaten08a feature visualization on ScanObjectNN dataset, where the feature extracted by our CSCon is more concrete and discriminative than previous methods.
  • Figure 4: Performance evaluation of different masking ratios on the PB-T50-RS dataset.
  • Figure 5: Performance of different pre-training augmentations on ScanObjectNN dataset.
  • ...and 1 more figures