MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models
Yuqi Li, Junhao Dong, Chuanguang Yang, Shiping Wen, Piotr Koniusz, Tingwen Huang, Yingli Tian, Yew-Soon Ong
TL;DR
MMT-ARD tackles the vulnerability of Vision-Language Models to adversarial perturbations by introducing a multimodal, dual-teacher adversarial distillation framework. A clean and an adversarial teacher are jointly used to guide a student, with a dynamic confidence-based weighting scheme and a cross-modal consistency constraint to stabilize multimodal embeddings. Theoretical robustness analyses establish transfer bounds from the teacher ensemble to the student, and extensive experiments on ImageNet demonstrate notable gains in robust accuracy and zero-shot performance, along with significant training efficiency improvements. The approach shows strong generalization across backbones and modalities, offering a scalable path toward safer, more reliable multimodal understanding in safety-critical applications.
Abstract
Vision-Language Models (VLMs) are increasingly deployed in safety-critical applications, making their adversarial robustness a crucial concern. While adversarial knowledge distillation has shown promise in transferring robustness from teacher to student models, traditional single-teacher approaches suffer from limited knowledge diversity, slow convergence, and difficulty in balancing robustness and accuracy. To address these challenges, we propose MMT-ARD: a Multimodal Multi-Teacher Adversarial Robust Distillation framework. Our key innovation is a dual-teacher knowledge fusion architecture that collaboratively optimizes clean feature preservation and robust feature enhancement. To better handle challenging adversarial examples, we introduce a dynamic weight allocation strategy based on teacher confidence, enabling adaptive focus on harder samples. Moreover, to mitigate bias among teachers, we design an adaptive sigmoid-based weighting function that balances the strength of knowledge transfer across modalities. Extensive experiments on ImageNet and zero-shot benchmarks demonstrate that MMT-ARD improves robust accuracy by +4.32% and zero-shot accuracy by +3.5% on the ViT-B-32 model, while achieving a 2.3x increase in training efficiency over traditional single-teacher methods. These results highlight the effectiveness and scalability of MMT-ARD in enhancing the adversarial robustness of multimodal large models. Our codes are available at https://github.com/itsnotacie/MMT-ARD.
