Table of Contents
Fetching ...

Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers

Longkun Zou, Wanru Zhu, Ke Chen, Lihua Guo, Kailing Guo, Kui Jia, Yaowei Wang

TL;DR

This paper tackles unsupervised domain adaptation for 3D point cloud classification by distilling relational priors from a pretrained 2D transformer into a 3D model. It introduces a parameter-frozen, shared Transformer module for online cross-modal distillation, complemented by a self-supervised masked reconstruction task that fuses masked point patches with masked multi-view image features. The approach yields state-of-the-art results on PointDA-10 and Sim-to-Real benchmarks and demonstrates that leveraging 2D relational priors can substantially improve cross-domain generalization for 3D data. The proposed framework offers a practical, scalable pathway to bridge 2D and 3D modalities without requiring massive 3D pretraining data.

Abstract

Semantic pattern of an object point cloud is determined by its topological configuration of local geometries. Learning discriminative representations can be challenging due to large shape variations of point sets in local regions and incomplete surface in a global perspective, which can be made even more severe in the context of unsupervised domain adaptation (UDA). In specific, traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries, which greatly limits their cross-domain generalization. Recently, the transformer-based models have achieved impressive performance gain in a range of image-based tasks, benefiting from its strong generalization capability and scalability stemming from capturing long range correlation across local patches. Inspired by such successes of visual transformers, we propose a novel Relational Priors Distillation (RPD) method to extract relational priors from the well-trained transformers on massive images, which can significantly empower cross-domain representations with consistent topological priors of objects. To this end, we establish a parameter-frozen pre-trained transformer module shared between 2D teacher and 3D student models, complemented by an online knowledge distillation strategy for semantically regularizing the 3D student model. Furthermore, we introduce a novel self-supervised task centered on reconstructing masked point cloud patches using corresponding masked multi-view image features, thereby empowering the model with incorporating 3D geometric information. Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification. The source code of this work is available at https://github.com/zou-longkun/RPD.git.

Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers

TL;DR

This paper tackles unsupervised domain adaptation for 3D point cloud classification by distilling relational priors from a pretrained 2D transformer into a 3D model. It introduces a parameter-frozen, shared Transformer module for online cross-modal distillation, complemented by a self-supervised masked reconstruction task that fuses masked point patches with masked multi-view image features. The approach yields state-of-the-art results on PointDA-10 and Sim-to-Real benchmarks and demonstrates that leveraging 2D relational priors can substantially improve cross-domain generalization for 3D data. The proposed framework offers a practical, scalable pathway to bridge 2D and 3D modalities without requiring massive 3D pretraining data.

Abstract

Semantic pattern of an object point cloud is determined by its topological configuration of local geometries. Learning discriminative representations can be challenging due to large shape variations of point sets in local regions and incomplete surface in a global perspective, which can be made even more severe in the context of unsupervised domain adaptation (UDA). In specific, traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries, which greatly limits their cross-domain generalization. Recently, the transformer-based models have achieved impressive performance gain in a range of image-based tasks, benefiting from its strong generalization capability and scalability stemming from capturing long range correlation across local patches. Inspired by such successes of visual transformers, we propose a novel Relational Priors Distillation (RPD) method to extract relational priors from the well-trained transformers on massive images, which can significantly empower cross-domain representations with consistent topological priors of objects. To this end, we establish a parameter-frozen pre-trained transformer module shared between 2D teacher and 3D student models, complemented by an online knowledge distillation strategy for semantically regularizing the 3D student model. Furthermore, we introduce a novel self-supervised task centered on reconstructing masked point cloud patches using corresponding masked multi-view image features, thereby empowering the model with incorporating 3D geometric information. Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification. The source code of this work is available at https://github.com/zou-longkun/RPD.git.
Paper Structure (23 sections, 14 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 23 sections, 14 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of the proposed relational prior distillation framework (RPD) method. We leverage the relational priors of one pretrianed 2D Transformer model to boost the 3D Transfermer encoder via sharing a parameter-frozen pretrained Transformer module and employing an online knowledge distillation strategy as semantic regularization for 3D student model. An ensemble of the knowledge from the two modalities can effectively improve the generalization of point cloud representations to close domain gap.
  • Figure 2: Overview of our proposed relational priors distillation framework, which adheres to a standard teacher-student distillation workflow. Both the 2D teacher model and the 3D student model include Patchify, Tokenizer, and several Transformer encoder layers. For 2D teacher model, we project the point cloud into 10 single-channel depth maps via the Realistic Projection Pipeline introduced by PointClip v2 PointCLIPv2, and then "patchify" these depth maps into $10 \times 14 \times 14$ image patches as input to the 2D Tokenizer (i.e. Conv2D). Tokens from the 2D Tokenizer and a [CLS] token are fed into the Transformer encoder. For 3D student model, we "patchify" the point cloud into 27 groups via Farthest Point Sampling (FPS) as input to the 3D Tokenizer (i.e. DGCNN DGCNN). Tokens from the 3D Tokenizer and a [CLS] token are fed into the Transformer encoder. The two modalities are processed independently by a siamese Transformer encoder parametrized by a MAE MAE pre-trained ViT vit. During training, we randomly mask a pairs of point cloud token features and image token features with a huge fraction of 0.85. The decoder consists of a sequence of multi-head cross-attention (MCA) and multi-heat self-attention (MSA) layers and predicts missing patches in the point cloud with unmasked image token features. PE means the position encoding. Gray boxes indicate parameters are frozen, while blue, green and orange boxes indicate parameters can be updated. (Best viewed in color).
  • Figure 3: Confusion matrices of classifying testing samples on target domain under four simulation-to-reality scenarios of M$\rightarrow$S*, S$\rightarrow$S*, M11$\rightarrow$SO*11, and S9$\rightarrow$SO*9.
  • Figure 4: Illustration of cross-modal knowledge fusion.
  • Figure 5: Visualization of reconstructed point cloud samples with random masking in the target domain of PointDA-10.
  • ...and 1 more figures