Table of Contents
Fetching ...

Multimodal Robust Prompt Distillation for 3D Point Cloud Models

Xiang Gu, Liming Lu, Xu Zheng, Anan Du, Yongbin Zhou, Shuchao Pang

TL;DR

This paper tackles the vulnerability of 3D point cloud models to adversarial attacks, which is exacerbated by high computational costs of many defenses. It proposes Multimodal Robust Prompt Distillation (MRPD), a teacher-student framework that distills robustness from three modalities—image (via depth projections), text (learnable prompts), and a strong 3D teacher—into lightweight prompts for the student, with a confidence-gated distillation loss and dynamic modality balancing. The training-time distillation yields robust models with zero inference overhead, achieving state-of-the-art or competitive robustness on ModelNet40 and ScanObjectNN under white-box and black-box attacks, while preserving or improving clean accuracy. The work demonstrates that multimodal knowledge transfer, when coupled with prompt tuning, provides a practical and generalizable defense for 3D vision systems, with broad implications for secure, real-time 3D perception.

Abstract

Adversarial attacks pose a significant threat to learning-based 3D point cloud models, critically undermining their reliability in security-sensitive applications. Existing defense methods often suffer from (1) high computational overhead and (2) poor generalization ability across diverse attack types. To bridge these gaps, we propose a novel yet efficient teacher-student framework, namely Multimodal Robust Prompt Distillation (MRPD) for distilling robust 3D point cloud model. It learns lightweight prompts by aligning student point cloud model's features with robust embeddings from three distinct teachers: a vision model processing depth projections, a high-performance 3D model, and a text encoder. To ensure a reliable knowledge transfer, this distillation is guided by a confidence-gated mechanism which dynamically balances the contribution of all input modalities. Notably, since the distillation is all during the training stage, there is no additional computational cost at inference. Extensive experiments demonstrate that MRPD substantially outperforms state-of-the-art defense methods against a wide range of white-box and black-box attacks, while even achieving better performance on clean data. Our work presents a new, practical paradigm for building robust 3D vision systems by efficiently harnessing multimodal knowledge.

Multimodal Robust Prompt Distillation for 3D Point Cloud Models

TL;DR

This paper tackles the vulnerability of 3D point cloud models to adversarial attacks, which is exacerbated by high computational costs of many defenses. It proposes Multimodal Robust Prompt Distillation (MRPD), a teacher-student framework that distills robustness from three modalities—image (via depth projections), text (learnable prompts), and a strong 3D teacher—into lightweight prompts for the student, with a confidence-gated distillation loss and dynamic modality balancing. The training-time distillation yields robust models with zero inference overhead, achieving state-of-the-art or competitive robustness on ModelNet40 and ScanObjectNN under white-box and black-box attacks, while preserving or improving clean accuracy. The work demonstrates that multimodal knowledge transfer, when coupled with prompt tuning, provides a practical and generalizable defense for 3D vision systems, with broad implications for secure, real-time 3D perception.

Abstract

Adversarial attacks pose a significant threat to learning-based 3D point cloud models, critically undermining their reliability in security-sensitive applications. Existing defense methods often suffer from (1) high computational overhead and (2) poor generalization ability across diverse attack types. To bridge these gaps, we propose a novel yet efficient teacher-student framework, namely Multimodal Robust Prompt Distillation (MRPD) for distilling robust 3D point cloud model. It learns lightweight prompts by aligning student point cloud model's features with robust embeddings from three distinct teachers: a vision model processing depth projections, a high-performance 3D model, and a text encoder. To ensure a reliable knowledge transfer, this distillation is guided by a confidence-gated mechanism which dynamically balances the contribution of all input modalities. Notably, since the distillation is all during the training stage, there is no additional computational cost at inference. Extensive experiments demonstrate that MRPD substantially outperforms state-of-the-art defense methods against a wide range of white-box and black-box attacks, while even achieving better performance on clean data. Our work presents a new, practical paradigm for building robust 3D vision systems by efficiently harnessing multimodal knowledge.

Paper Structure

This paper contains 33 sections, 11 equations, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: (a) Traditional Defenses: Heavy modules with increasing inference costs. (b) Our Inference: The optimized prompts provide robustness with zero computational overhead. (c) Our Training: We distill robust, multimodal knowledge into lightweight prompts.
  • Figure 2: The proposed Multimodal Robust Prompt Distillation (MRPD) framework. During training, robust knowledge from three teachers (image, text, and a 3D model) is distilled into lightweight prompts. At inference, these prompts enhance the student model's robustness with zero additional computational cost.
  • Figure 3: MRPD preserves feature space integrity under adversarial attack on ModelNet40. (a) Without defense, features from different classes become indistinguishable. (b) With MRPD, features remain well-separated, ensuring robust classification.
  • Figure 4: Evolution of learned loss weights ($1/\sigma^2$) for each distillation task. The model learns to heavily prioritize the point and image teachers while using the text teacher as a low-weight semantic regularizer.
  • Figure 5: t-SNE visualization of point cloud feature embeddings on the ScanObjectNN testset under the white-box Perturb attack. (a) Features from the unprotected model collapse into an indistinguishable mass. (b) Our method, MRPD, successfully restores clear class separation, leading to robust classification.