Table of Contents
Fetching ...

Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels

Amaya Dharmasiri, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan

TL;DR

Cross-MoST tackles the challenge of learning open-vocabulary 3D classification without labels by jointly training image and point-cloud encoders in a shared embedding space. It introduces a teacher-student framework that generates joint pseudo-labels from unlabeled 3D data and their 2D views, and employs cross-modal feature alignment along with masked modeling to regularize and enrich representations. The method demonstrates substantial gains over zeroshot and single-modality self-training across eight synthetic and real-world datasets, illustrating robust cross-modal knowledge transfer between images and 3D point clouds. This label-free framework leverages CLIP-style priors and modal complementarities to enable practical, scalable 3D classification in real-world settings, with the potential for further improvements from stronger 2D priors and richer pretraining.

Abstract

Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models. However, current methods require supervised pre-training for such alignment, and the performance of such 3D zero-shot models remains sub-optimal for real-world adaptation. In this work, we propose an optimization framework: Cross-MoST: Cross-Modal Self-Training, to improve the label-free classification performance of a zero-shot 3D vision model by simply leveraging unlabeled 3D data and their accompanying 2D views. We propose a student-teacher framework to simultaneously process 2D views and 3D point clouds and generate joint pseudo labels to train a classifier and guide cross-model feature alignment. Thereby we demonstrate that 2D vision language models such as CLIP can be used to complement 3D representation learning to improve classification performance without the need for expensive class annotations. Using synthetic and real-world 3D datasets, we further demonstrate that Cross-MoST enables efficient cross-modal knowledge exchange resulting in both image and point cloud modalities learning from each other's rich representations.

Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels

TL;DR

Cross-MoST tackles the challenge of learning open-vocabulary 3D classification without labels by jointly training image and point-cloud encoders in a shared embedding space. It introduces a teacher-student framework that generates joint pseudo-labels from unlabeled 3D data and their 2D views, and employs cross-modal feature alignment along with masked modeling to regularize and enrich representations. The method demonstrates substantial gains over zeroshot and single-modality self-training across eight synthetic and real-world datasets, illustrating robust cross-modal knowledge transfer between images and 3D point clouds. This label-free framework leverages CLIP-style priors and modal complementarities to enable practical, scalable 3D classification in real-world settings, with the potential for further improvements from stronger 2D priors and richer pretraining.

Abstract

Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models. However, current methods require supervised pre-training for such alignment, and the performance of such 3D zero-shot models remains sub-optimal for real-world adaptation. In this work, we propose an optimization framework: Cross-MoST: Cross-Modal Self-Training, to improve the label-free classification performance of a zero-shot 3D vision model by simply leveraging unlabeled 3D data and their accompanying 2D views. We propose a student-teacher framework to simultaneously process 2D views and 3D point clouds and generate joint pseudo labels to train a classifier and guide cross-model feature alignment. Thereby we demonstrate that 2D vision language models such as CLIP can be used to complement 3D representation learning to improve classification performance without the need for expensive class annotations. Using synthetic and real-world 3D datasets, we further demonstrate that Cross-MoST enables efficient cross-modal knowledge exchange resulting in both image and point cloud modalities learning from each other's rich representations.
Paper Structure (23 sections, 14 equations, 9 figures, 12 tables)

This paper contains 23 sections, 14 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Proposed cross-modal self-training achieves significant performance gains over zero-shot ULIP 3D classification, as well as recently proposed self-trainingMUST applied on point clouds.
  • Figure 2: Cross-modal Self-training for 3D point clouds and their corresponding 2D views. The teacher (blue) weights are updated as an exponentially moving average of the student (green). The teacher generates joint pseudo-labels to allow cross-modal self-training. Our MPM and MIM modules inside the student model implement masked point and image modeling. Align represents the cross-modal feature alignment, whereas GL-Align within MIM and MPM modules represent global-local feature alignment to support masked modeling within each individual modality (image and pointcloud).
  • Figure 3: As the training progresses, biasness towards certain classes is significantly reduced in both branches. Predictions on each branch become more sharp, as indicated by increasing entropy. (modelnet40)
  • Figure 4: The percentage of pseudo-labels selected from each modality for combined self-training. The agreement between pseudo-labels increases as our training progresses. (modelnet40)
  • Figure 5: Qualitative comparison of datasets.
  • ...and 4 more figures