Table of Contents
Fetching ...

Learning Robust 3D Representation from CLIP via Dual Denoising

Shuqing Luo, Bowen Qu, Wei Gao

TL;DR

This work targets robust 3D representation learning by leveraging pre-trained vision-language models like CLIP. It introduces Dual Denoising, a cross-modal distillation framework that couples a denoising-based PointDAE proxy task with a feature denoising network, connected via cross-attention, and enhanced by test-time parallel noise inference. The approach yields superior zero-shot 3D recognition performance and improved adversarial robustness under zero-shot settings, without requiring adversarial training. The findings suggest that diffusion-inspired denoising and cross-modal guidance can substantially improve generalization across domains and data distributions, with practical implications for robust 3D understanding in open-world scenarios.

Abstract

In this paper, we explore a critical yet under-investigated issue: how to learn robust and well-generalized 3D representation from pre-trained vision language models such as CLIP. Previous works have demonstrated that cross-modal distillation can provide rich and useful knowledge for 3D data. However, like most deep learning models, the resultant 3D learning network is still vulnerable to adversarial attacks especially the iterative attack. In this work, we propose Dual Denoising, a novel framework for learning robust and well-generalized 3D representations from CLIP. It combines a denoising-based proxy task with a novel feature denoising network for 3D pre-training. Additionally, we propose utilizing parallel noise inference to enhance the generalization of point cloud features under cross domain settings. Experiments show that our model can effectively improve the representation learning performance and adversarial robustness of the 3D learning network under zero-shot settings without adversarial training. Our code is available at https://github.com/luoshuqing2001/Dual_Denoising.

Learning Robust 3D Representation from CLIP via Dual Denoising

TL;DR

This work targets robust 3D representation learning by leveraging pre-trained vision-language models like CLIP. It introduces Dual Denoising, a cross-modal distillation framework that couples a denoising-based PointDAE proxy task with a feature denoising network, connected via cross-attention, and enhanced by test-time parallel noise inference. The approach yields superior zero-shot 3D recognition performance and improved adversarial robustness under zero-shot settings, without requiring adversarial training. The findings suggest that diffusion-inspired denoising and cross-modal guidance can substantially improve generalization across domains and data distributions, with practical implications for robust 3D understanding in open-world scenarios.

Abstract

In this paper, we explore a critical yet under-investigated issue: how to learn robust and well-generalized 3D representation from pre-trained vision language models such as CLIP. Previous works have demonstrated that cross-modal distillation can provide rich and useful knowledge for 3D data. However, like most deep learning models, the resultant 3D learning network is still vulnerable to adversarial attacks especially the iterative attack. In this work, we propose Dual Denoising, a novel framework for learning robust and well-generalized 3D representations from CLIP. It combines a denoising-based proxy task with a novel feature denoising network for 3D pre-training. Additionally, we propose utilizing parallel noise inference to enhance the generalization of point cloud features under cross domain settings. Experiments show that our model can effectively improve the representation learning performance and adversarial robustness of the 3D learning network under zero-shot settings without adversarial training. Our code is available at https://github.com/luoshuqing2001/Dual_Denoising.
Paper Structure (21 sections, 6 equations, 6 figures, 5 tables)

This paper contains 21 sections, 6 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Visualization of raw point cloud, masked point cloud, and noised point cloud. The raw point cloud contains 1024 points. For masked point cloud, we first compute 128 centroids using farthest point sampling (FPS) algorithm, then select 16 neighbors for each centroid using $K$ nearest neighbor (kNN) algorithm. We mask 80% of the clusters and visualize the remained points. For noised point cloud, we diffuse the input by $x_t=x_0+\sigma_t\epsilon$, where $\{\sigma_t\}_{t=0}^{T-1}$ is a linear schedule from 0 to $s=0.08$. We choose $t=600$ and $T=1000$.
  • Figure 2: Pipeline of our Dual Denoising algorithm. The upper branch is PointDAE, performing as a proxy task during pre-training. The lower branch is feature denoising, gradually transforming a gaussian noise to CLIP feature under the guidance from the upper branch.
  • Figure 3: Implementation of PointDAE during training. Point cloud is tokenized to fit the ViT architecture. We first using farthest point sampling (FPS) on the clean data to get a representative subset of indices. Then we compute the clean key point subset and noisy one under the same indices. Next we conduct $k$ nearest neighbor (kNN) to get the noisy point tokens and clean tokens respectively. Reconstruction loss like chamfer distance is used between them.
  • Figure 4: The details of a basic block in PointDAE (left) and feature denoising network (right). Notice that the structure of point cloud denoising decoder is similar to the encoder, only without the cross-attention connection with feature denoising branch. Point cloud tokens perform as $K$ and $V$ while feature tokens perform as $Q$ in the cross attention module. A stop gradient operation is jointly used during pre-training to avoid representation collapse.
  • Figure 5: Visualization of adversarial robustness under PGD attack on ModelNet40 and ScanObjectNN. We use ModelNet40 test set and OBJ_ONLY test set, respectively.
  • ...and 1 more figures