Table of Contents
Fetching ...

EPCL: Frozen CLIP Transformer is An Efficient Point Cloud Encoder

Xiaoshui Huang, Zhou Huang, Sheng Li, Wentao Qu, Tong He, Yuenan Hou, Yifan Zuo, Wanli Ouyang

TL;DR

EPCL presents a data- and compute-efficient strategy to repurpose a frozen CLIP transformer as a 3D point cloud encoder. By introducing a lightweight point tokenizer and a learnable task token, EPCL maps 3D neighborhoods into CLIP’s token space and preserves cross-modal semantic alignment without requiring paired 2D-3D data or 3D pretraining. Across detection, semantic segmentation, classification, and few-shot learning, EPCL achieves competitive or superior results to contemporary 3D pre-training methods while dramatically reducing trainable parameters. This approach highlights the practicality of cross-modal knowledge transfer in 3D understanding and offers a scalable path for efficient point cloud representation learning.

Abstract

The pretrain-finetune paradigm has achieved great success in NLP and 2D image fields because of the high-quality representation ability and transferability of their pretrained models. However, pretraining such a strong model is difficult in the 3D point cloud field due to the limited amount of point cloud sequences. This paper introduces \textbf{E}fficient \textbf{P}oint \textbf{C}loud \textbf{L}earning (EPCL), an effective and efficient point cloud learner for directly training high-quality point cloud models with a frozen CLIP transformer. Our EPCL connects the 2D and 3D modalities by semantically aligning the image features and point cloud features without paired 2D-3D data. Specifically, the input point cloud is divided into a series of local patches, which are converted to token embeddings by the designed point cloud tokenizer. These token embeddings are concatenated with a task token and fed into the frozen CLIP transformer to learn point cloud representation. The intuition is that the proposed point cloud tokenizer projects the input point cloud into a unified token space that is similar to the 2D images. Comprehensive experiments on 3D detection, semantic segmentation, classification and few-shot learning demonstrate that the CLIP transformer can serve as an efficient point cloud encoder and our method achieves promising performance on both indoor and outdoor benchmarks. In particular, performance gains brought by our EPCL are $\textbf{19.7}$ AP$_{50}$ on ScanNet V2 detection, $\textbf{4.4}$ mIoU on S3DIS segmentation and $\textbf{1.2}$ mIoU on SemanticKITTI segmentation compared to contemporary pretrained models. Code is available at \url{https://github.com/XiaoshuiHuang/EPCL}.

EPCL: Frozen CLIP Transformer is An Efficient Point Cloud Encoder

TL;DR

EPCL presents a data- and compute-efficient strategy to repurpose a frozen CLIP transformer as a 3D point cloud encoder. By introducing a lightweight point tokenizer and a learnable task token, EPCL maps 3D neighborhoods into CLIP’s token space and preserves cross-modal semantic alignment without requiring paired 2D-3D data or 3D pretraining. Across detection, semantic segmentation, classification, and few-shot learning, EPCL achieves competitive or superior results to contemporary 3D pre-training methods while dramatically reducing trainable parameters. This approach highlights the practicality of cross-modal knowledge transfer in 3D understanding and offers a scalable path for efficient point cloud representation learning.

Abstract

The pretrain-finetune paradigm has achieved great success in NLP and 2D image fields because of the high-quality representation ability and transferability of their pretrained models. However, pretraining such a strong model is difficult in the 3D point cloud field due to the limited amount of point cloud sequences. This paper introduces \textbf{E}fficient \textbf{P}oint \textbf{C}loud \textbf{L}earning (EPCL), an effective and efficient point cloud learner for directly training high-quality point cloud models with a frozen CLIP transformer. Our EPCL connects the 2D and 3D modalities by semantically aligning the image features and point cloud features without paired 2D-3D data. Specifically, the input point cloud is divided into a series of local patches, which are converted to token embeddings by the designed point cloud tokenizer. These token embeddings are concatenated with a task token and fed into the frozen CLIP transformer to learn point cloud representation. The intuition is that the proposed point cloud tokenizer projects the input point cloud into a unified token space that is similar to the 2D images. Comprehensive experiments on 3D detection, semantic segmentation, classification and few-shot learning demonstrate that the CLIP transformer can serve as an efficient point cloud encoder and our method achieves promising performance on both indoor and outdoor benchmarks. In particular, performance gains brought by our EPCL are AP on ScanNet V2 detection, mIoU on S3DIS segmentation and mIoU on SemanticKITTI segmentation compared to contemporary pretrained models. Code is available at \url{https://github.com/XiaoshuiHuang/EPCL}.
Paper Structure (24 sections, 2 equations, 13 figures, 10 tables)

This paper contains 24 sections, 2 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: (a) Traditional paradigm fine-tunes the whole model, while our method only fine-tunes the tokenizer (T) and head (H). The CLIP transformer, which is initialized from the original CLIP weight, is kept frozen during training. (b) Our EPCL brings accuracy gains with higher training efficiency compared to SOTA pre-training methods.
  • Figure 2: Using the frozen CLIP image transformer as an encoder for 2D and 3D classification, the saliency maps show the frozen CLIP model can attend to similar regions at different modalities.
  • Figure 3: Schematic overview of EPCL. The Point Tokenizer contains two successive steps, that are Farthest Point Sampling (FPS) for downsampling the input point cloud and Multi-Layer Perceptron (MLP) for extracting features from the downsampled point cloud. The Task Token is task-specific and learnable. Tokens from the point tokenizer and task token are fed into the frozen CLIP Transformer. The Head uses the tokens from the Transformer to yield the predictions for each specific downstream task. The CLIP transformer, which is initialized from the original CLIP weight, is kept frozen during the training stage, while the point cloud tokenizer, task token and head are trainable.
  • Figure 4: The cross-correlation between CLIP image features and point cloud features at layers 1, 9, and 12 for different object categories.
  • Figure 5: The semantic similarity between 2D image and 3D point cloud from significance maps.
  • ...and 8 more figures