Robust image classification with multi-modal large language models
Francesco Villani, Igor Maljkovic, Dario Lazzaro, Angelo Sotgiu, Antonio Emanuele Cinà, Fabio Roli
TL;DR
The paper tackles adversarial robustness in image classification by introducing Multi-Shield, a defense that fuses an adversarially trained image classifier with a multimodal vision-language detector (CLIP) to identify semantic misalignment between visual inputs and textual prompts. It defines a rejection mechanism based on agreement between unimodal and multimodal predictions, abstaining when disagreement indicates potential adversarial manipulation. Empirical results across six datasets and multiple models show that Multi-Shield substantially boosts robust accuracy, with average gains of about 32% on CIFAR-10 and 65% on ImageNet, while incurring small reductions in clean accuracy. The work demonstrates that incorporating semantic cross-modal signals provides a practical, scalable augmentation to existing defenses, improving resilience against adaptive attackers without extensive retraining.
Abstract
Deep Neural Networks are vulnerable to adversarial examples, i.e., carefully crafted input samples that can cause models to make incorrect predictions with high confidence. To mitigate these vulnerabilities, adversarial training and detection-based defenses have been proposed to strengthen models in advance. However, most of these approaches focus on a single data modality, overlooking the relationships between visual patterns and textual descriptions of the input. In this paper, we propose a novel defense, MultiShield, designed to combine and complement these defenses with multi-modal information to further enhance their robustness. MultiShield leverages multi-modal large language models to detect adversarial examples and abstain from uncertain classifications when there is no alignment between textual and visual representations of the input. Extensive evaluations on CIFAR-10 and ImageNet datasets, using robust and non-robust image classification models, demonstrate that MultiShield can be easily integrated to detect and reject adversarial examples, outperforming the original defenses.
