Learning Robust 3D Representation from CLIP via Dual Denoising
Shuqing Luo, Bowen Qu, Wei Gao
TL;DR
This work targets robust 3D representation learning by leveraging pre-trained vision-language models like CLIP. It introduces Dual Denoising, a cross-modal distillation framework that couples a denoising-based PointDAE proxy task with a feature denoising network, connected via cross-attention, and enhanced by test-time parallel noise inference. The approach yields superior zero-shot 3D recognition performance and improved adversarial robustness under zero-shot settings, without requiring adversarial training. The findings suggest that diffusion-inspired denoising and cross-modal guidance can substantially improve generalization across domains and data distributions, with practical implications for robust 3D understanding in open-world scenarios.
Abstract
In this paper, we explore a critical yet under-investigated issue: how to learn robust and well-generalized 3D representation from pre-trained vision language models such as CLIP. Previous works have demonstrated that cross-modal distillation can provide rich and useful knowledge for 3D data. However, like most deep learning models, the resultant 3D learning network is still vulnerable to adversarial attacks especially the iterative attack. In this work, we propose Dual Denoising, a novel framework for learning robust and well-generalized 3D representations from CLIP. It combines a denoising-based proxy task with a novel feature denoising network for 3D pre-training. Additionally, we propose utilizing parallel noise inference to enhance the generalization of point cloud features under cross domain settings. Experiments show that our model can effectively improve the representation learning performance and adversarial robustness of the 3D learning network under zero-shot settings without adversarial training. Our code is available at https://github.com/luoshuqing2001/Dual_Denoising.
