Table of Contents
Fetching ...

MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models

Yuqi Li, Junhao Dong, Chuanguang Yang, Shiping Wen, Piotr Koniusz, Tingwen Huang, Yingli Tian, Yew-Soon Ong

TL;DR

MMT-ARD tackles the vulnerability of Vision-Language Models to adversarial perturbations by introducing a multimodal, dual-teacher adversarial distillation framework. A clean and an adversarial teacher are jointly used to guide a student, with a dynamic confidence-based weighting scheme and a cross-modal consistency constraint to stabilize multimodal embeddings. Theoretical robustness analyses establish transfer bounds from the teacher ensemble to the student, and extensive experiments on ImageNet demonstrate notable gains in robust accuracy and zero-shot performance, along with significant training efficiency improvements. The approach shows strong generalization across backbones and modalities, offering a scalable path toward safer, more reliable multimodal understanding in safety-critical applications.

Abstract

Vision-Language Models (VLMs) are increasingly deployed in safety-critical applications, making their adversarial robustness a crucial concern. While adversarial knowledge distillation has shown promise in transferring robustness from teacher to student models, traditional single-teacher approaches suffer from limited knowledge diversity, slow convergence, and difficulty in balancing robustness and accuracy. To address these challenges, we propose MMT-ARD: a Multimodal Multi-Teacher Adversarial Robust Distillation framework. Our key innovation is a dual-teacher knowledge fusion architecture that collaboratively optimizes clean feature preservation and robust feature enhancement. To better handle challenging adversarial examples, we introduce a dynamic weight allocation strategy based on teacher confidence, enabling adaptive focus on harder samples. Moreover, to mitigate bias among teachers, we design an adaptive sigmoid-based weighting function that balances the strength of knowledge transfer across modalities. Extensive experiments on ImageNet and zero-shot benchmarks demonstrate that MMT-ARD improves robust accuracy by +4.32% and zero-shot accuracy by +3.5% on the ViT-B-32 model, while achieving a 2.3x increase in training efficiency over traditional single-teacher methods. These results highlight the effectiveness and scalability of MMT-ARD in enhancing the adversarial robustness of multimodal large models. Our codes are available at https://github.com/itsnotacie/MMT-ARD.

MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models

TL;DR

MMT-ARD tackles the vulnerability of Vision-Language Models to adversarial perturbations by introducing a multimodal, dual-teacher adversarial distillation framework. A clean and an adversarial teacher are jointly used to guide a student, with a dynamic confidence-based weighting scheme and a cross-modal consistency constraint to stabilize multimodal embeddings. Theoretical robustness analyses establish transfer bounds from the teacher ensemble to the student, and extensive experiments on ImageNet demonstrate notable gains in robust accuracy and zero-shot performance, along with significant training efficiency improvements. The approach shows strong generalization across backbones and modalities, offering a scalable path toward safer, more reliable multimodal understanding in safety-critical applications.

Abstract

Vision-Language Models (VLMs) are increasingly deployed in safety-critical applications, making their adversarial robustness a crucial concern. While adversarial knowledge distillation has shown promise in transferring robustness from teacher to student models, traditional single-teacher approaches suffer from limited knowledge diversity, slow convergence, and difficulty in balancing robustness and accuracy. To address these challenges, we propose MMT-ARD: a Multimodal Multi-Teacher Adversarial Robust Distillation framework. Our key innovation is a dual-teacher knowledge fusion architecture that collaboratively optimizes clean feature preservation and robust feature enhancement. To better handle challenging adversarial examples, we introduce a dynamic weight allocation strategy based on teacher confidence, enabling adaptive focus on harder samples. Moreover, to mitigate bias among teachers, we design an adaptive sigmoid-based weighting function that balances the strength of knowledge transfer across modalities. Extensive experiments on ImageNet and zero-shot benchmarks demonstrate that MMT-ARD improves robust accuracy by +4.32% and zero-shot accuracy by +3.5% on the ViT-B-32 model, while achieving a 2.3x increase in training efficiency over traditional single-teacher methods. These results highlight the effectiveness and scalability of MMT-ARD in enhancing the adversarial robustness of multimodal large models. Our codes are available at https://github.com/itsnotacie/MMT-ARD.

Paper Structure

This paper contains 22 sections, 9 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Multidimensional performance comparison of MMT-ARD with the baseline under different backbone. (a) Teacher-student combination based on ViT-B-32 and RN50. (b) Combination based on ViT-B-32-lora and RN101. The method proposed in this study (Our 1-4) comprehensively outperforms the baseline methods (Baseline 1-4) across the clean accuracy (acc) and robust accuracy (racc).
  • Figure 2: MMT-ARD framework architecture, where the same input image is processed separately by two sets of encoders from the original teacher and the adversarial teacher. L1 and L2, which respectively constrain the consistency of the student model's outputs with those of the two teachers, ultimately achieving collaborative transfer of robust representations through a weighted sum.
  • Figure 3: Heatmaps of the models for different teacher-student pairs.

Theorems & Definitions (1)

  • Remark 1