Table of Contents
Fetching ...

CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training

Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui Huang, Rynson W. H. Lau, Wanli Ouyang, Wangmeng Zuo

TL;DR

CLIP2Point tackles the scarcity of large-scale vision-language pre-training for 3D by transferring CLIP through an image-depth pre-training scheme that learns a depth encoder aligned to CLIP visuals. It introduces cross-modality and intra-modality contrastive losses, a novel depth rendering strategy, and a Gated Dual-Path Adapter (GDPA) to adapt to downstream tasks. The approach, evaluated on ShapeNet-derived pre-training and zero-shot/few-shot benchmarks like ModelNet and ScanObjectNN, achieves state-of-the-art results in multiple settings and demonstrates effective cross-modal transfer for 3D point cloud classification. Overall, CLIP2Point provides a practical pathway to leverage CLIP in open-world 3D understanding with efficient downstream adaptation.

Abstract

Pre-training across 3D vision and language remains under development because of limited training data. Recent works attempt to transfer vision-language pre-training models to 3D vision. PointCLIP converts point cloud data to multi-view depth maps, adopting CLIP for shape classification. However, its performance is restricted by the domain gap between rendered depth maps and images, as well as the diversity of depth distributions. To address this issue, we propose CLIP2Point, an image-depth pre-training method by contrastive learning to transfer CLIP to the 3D domain, and adapt it to point cloud classification. We introduce a new depth rendering setting that forms a better visual effect, and then render 52,460 pairs of images and depth maps from ShapeNet for pre-training. The pre-training scheme of CLIP2Point combines cross-modality learning to enforce the depth features for capturing expressive visual and textual features and intra-modality learning to enhance the invariance of depth aggregation. Additionally, we propose a novel Dual-Path Adapter (DPA) module, i.e., a dual-path structure with simplified adapters for few-shot learning. The dual-path structure allows the joint use of CLIP and CLIP2Point, and the simplified adapter can well fit few-shot tasks without post-search. Experimental results show that CLIP2Point is effective in transferring CLIP knowledge to 3D vision. Our CLIP2Point outperforms PointCLIP and other self-supervised 3D networks, achieving state-of-the-art results on zero-shot and few-shot classification.

CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training

TL;DR

CLIP2Point tackles the scarcity of large-scale vision-language pre-training for 3D by transferring CLIP through an image-depth pre-training scheme that learns a depth encoder aligned to CLIP visuals. It introduces cross-modality and intra-modality contrastive losses, a novel depth rendering strategy, and a Gated Dual-Path Adapter (GDPA) to adapt to downstream tasks. The approach, evaluated on ShapeNet-derived pre-training and zero-shot/few-shot benchmarks like ModelNet and ScanObjectNN, achieves state-of-the-art results in multiple settings and demonstrates effective cross-modal transfer for 3D point cloud classification. Overall, CLIP2Point provides a practical pathway to leverage CLIP in open-world 3D understanding with efficient downstream adaptation.

Abstract

Pre-training across 3D vision and language remains under development because of limited training data. Recent works attempt to transfer vision-language pre-training models to 3D vision. PointCLIP converts point cloud data to multi-view depth maps, adopting CLIP for shape classification. However, its performance is restricted by the domain gap between rendered depth maps and images, as well as the diversity of depth distributions. To address this issue, we propose CLIP2Point, an image-depth pre-training method by contrastive learning to transfer CLIP to the 3D domain, and adapt it to point cloud classification. We introduce a new depth rendering setting that forms a better visual effect, and then render 52,460 pairs of images and depth maps from ShapeNet for pre-training. The pre-training scheme of CLIP2Point combines cross-modality learning to enforce the depth features for capturing expressive visual and textual features and intra-modality learning to enhance the invariance of depth aggregation. Additionally, we propose a novel Dual-Path Adapter (DPA) module, i.e., a dual-path structure with simplified adapters for few-shot learning. The dual-path structure allows the joint use of CLIP and CLIP2Point, and the simplified adapter can well fit few-shot tasks without post-search. Experimental results show that CLIP2Point is effective in transferring CLIP knowledge to 3D vision. Our CLIP2Point outperforms PointCLIP and other self-supervised 3D networks, achieving state-of-the-art results on zero-shot and few-shot classification.
Paper Structure (28 sections, 13 equations, 9 figures, 7 tables)

This paper contains 28 sections, 13 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Overall architecture of CLIP transfer learning on the 3D domain. Point clouds are first projected to multi-view depth maps, and then aggregated by the CLIP visual encoder. Comparison with textual prompts presents the classification prediction. However, we argue that the domain gap exists between depth maps and CLIP pre-training images. To this end, a pre-trained depth encoder via CLIP2Point is proposed.
  • Figure 2: Pre-training scheme of CLIP2Point. We propose a self-supervised pre-training scheme with intra-modality and cross-modality contrastive learning to align depth features with CLIP visual features. We randomly choose a camera view for each 3D model and modify the distances of the view to construct a pair of rendered depth maps. We adopt one NT-Xent loss between pairs of depth features extracted from the depth encoder and the other between image features and average depth features. We freeze the image encoder during training, enforcing the depth features by depth encoder to be aligned with the image features by CLIP visual encoder. Additionally, instead of all the blue points, we only consider the red point during depth rendering, which improves the visual effect.
  • Figure 3: Gated Dual-Path Adapter (GDPA) for downstream learning. We design a dual-path structure, combining our pre-trained depth encoder with CLIP visual encoder. We propose a global-view aggregator and attach it to each encoder, which is parameter-efficient for downstream training. GDPA allows a fusion of knowledge in CLIP and our pre-training, enhancing the adaptation ability of CLIP2Point.
  • Figure 4: Visualization results of our rendered images with different rendering settings.
  • Figure 5: Visualization of feature distributions on ModelNet10.
  • ...and 4 more figures