Table of Contents
Fetching ...

CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation

Mainak Singha, Sarthak Mehrotra, Paolo Casari, Subhasis Chaudhuri, Elisa Ricci, Biplab Banerjee

TL;DR

This work introduces CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP, and applies parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections.

Abstract

Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines. Codes are available at https://github.com/SarthakM320/CLIPoint3D.

CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation

TL;DR

This work introduces CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP, and applies parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections.

Abstract

Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines. Codes are available at https://github.com/SarthakM320/CLIPoint3D.
Paper Structure (23 sections, 18 equations, 9 figures, 11 tables)

This paper contains 23 sections, 18 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Comparison of CLIPoint3D with SOTA methods on GraspNetPC-10. Encoder-based 3D UDA methods (e.g., PointDAN pointdan, GAST gast, MLSP mlsp) are accurate but computationally expensive, while CLIP-based extensions fail to bridge the synthetic-real gap. CLIPoint3D achieves +16.4% improvement with minimal overhead.
  • Figure 2: Overview of CLIPoint3D, the first CLIP-based unsupervised 3D point cloud domain adaptation framework, comprises four key modules: (1) Knowledge-driven prompt tuning generates LLM-guided textual and 3D-aware visual prompts; (2) Parameter-efficient fine-tuning (PEFT) jointly optimizes these prompts and the encoder while (3) entropy-based view selection filters unreliable projections; (4) Dual objectives, uncertainty-aware prototype loss $\mathbf{L}_{\mathrm{proto}}$ and optimal transport loss $\mathbf{L}_{\mathrm{OT}}$, achieve joint semantic and statistical alignment. Additional regularizers include $\mathbf{L}_{\mathrm{conf}} = \mathbf{L}_{\mathrm{conf(S)}} + \mathbf{L}_{\mathrm{conf(T)}}$, and $\mathbf{L}_{\mathrm{ortho}} = \mathbf{L}_{\mathrm{ortho(S)}} + \mathbf{L}_{\mathrm{ortho(T)}}$, to ensure stable learning across source and target domains.
  • Figure 3: (a) Effect of the number of labeled samples in $\mathcal{D}_s$ during training. (b) Effect of projected views. Accuracy variation with projection count $M$.
  • Figure 4: t-SNE visualization of CLIPoint3D's performance. Alignment between synthetic and real domains post-adaptation. FD and MMD quantify domain gap reduction.
  • Figure 5: LLM attributes generation. To derive high-level 3D knowledge representations, we follow a three-stage pipeline. First (top box), we provide an instructional query prompt to a LLM (e.g. GPT-5 gpt5). In response, the LLM produces detailed, geometry-aware visual descriptions (middle box). Finally (bottom box), we generate highly contextualized textual prompts (one caption per class) by combining a modality-specific prefix template with the LLM-generated attributes.
  • ...and 4 more figures