Table of Contents
Fetching ...

DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification

Sitian Shen, Zilin Zhu, Linqian Fan, Harry Zhang, Xinxiao Wu

TL;DR

This work tackles the domain gap between 2D CLIP pre-training and 3D point clouds, which hampers zero-shot 3D classification. It introduces DiffCLIP, a dual-branch framework that stylizes 3D projections into CLIP-friendly 2D inputs using stable diffusion with ControlNet, and augments the textual branch with a style-prompt generator for few-shot tasks. Key components include multi-view realistic projection, diffusion-based style transfer, and a meta-net–driven style-prompt mechanism, with two strategies for final prediction fusion. On ModelNet10/40 and ScanObjectNN, DiffCLIP achieves state-of-the-art zero-shot performance on OBJ_BG and competitive zero-shot scores on ModelNet10, while offering strong few-shot gains, demonstrating effective cross-domain alignment for 3D understanding.

Abstract

Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.

DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification

TL;DR

This work tackles the domain gap between 2D CLIP pre-training and 3D point clouds, which hampers zero-shot 3D classification. It introduces DiffCLIP, a dual-branch framework that stylizes 3D projections into CLIP-friendly 2D inputs using stable diffusion with ControlNet, and augments the textual branch with a style-prompt generator for few-shot tasks. Key components include multi-view realistic projection, diffusion-based style transfer, and a meta-net–driven style-prompt mechanism, with two strategies for final prediction fusion. On ModelNet10/40 and ScanObjectNN, DiffCLIP achieves state-of-the-art zero-shot performance on OBJ_BG and competitive zero-shot scores on ModelNet10, while offering strong few-shot gains, demonstrating effective cross-domain alignment for 3D understanding.

Abstract

Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.
Paper Structure (24 sections, 8 equations, 5 figures, 8 tables)

This paper contains 24 sections, 8 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Framework Structure of DiffCLIP. In the visual branch, DiffCLIP has two modules: Multi-View Realistic Projection Module, which produces multi-view depth maps, and Stable-Diffusion-Based Style Transfer Module, which uses a pre-trained ControlNet and frozen Stable Diffusion network to transfer styles on the depth images. In the textual branch, DiffCLIP uses an optional Style-Prompt Generation Module for few-shot tasks and manual prompts for zero-shot tasks. Frozen CLIP image encoder and text encoder are used to generate feature representations of images and text which then go through a Multi-Modal Fusion Block.
  • Figure 2: Style Transfer using Stable Diffusion and ControlNet: Illustrating Results for 10 Categories (right) in the ModelNet10 Dataset Using a "Monitor" Depth Map (left). For example, when transferring the style of "monitor" to "bathtub", the base of the monitor will be filled with blue rippling water. When transferring the style of "monitor" to "chair" or "sofa", characteristic textures of these objects are displayed. When using the "monitor" label to transfer its own style, the resulting image clearly generates the toolbar and menu bar on the computer screen.
  • Figure 3: Prompt generation module (left) and Multi-view Fusion Block (right) of DiffCLIP.
  • Figure 4: An example of matrix $P$.
  • Figure 5: An example of style transfer result. Logits of ten images through stable diffusion's style transfer and the following calculation from source depth condition, the 'Monitor', are shown.