Table of Contents
Fetching ...

MF-CLIP: Leveraging CLIP as Surrogate Models for No-box Adversarial Attacks

Jiaming Zhang, Lingyu Qiu, Qi Yi, Yige Li, Jitao Sang, Changsheng Xu, Dit-Yan Yeung

TL;DR

This work tackles no-box adversarial attacks by leveraging Vision-Language Models as surrogates, identifying that vanilla CLIP has strong representations but limited discriminative margins for domain-specific attacks. It introduces MF-CLIP, a two-stage framework consisting of margin-based fine-tuning to widen inter-class margins and a generator-based adversarial-perturbation module to produce transferable examples, yielding substantial performance gains over state-of-the-art baselines. The authors support their approach with theoretical margin analysis and extensive experiments across seven datasets, multiple target architectures, and large-scale ImageNet/ViT scenarios, reporting average improvements of $15.23\%$ on standard models and $9.52\%$ on adversarially trained models. The results underscore the critical role of surrogate-model discriminative power in no-box transferability and suggest that margin-focused fine-tuning of foundation models can significantly enhance adversarial effectiveness in realistic attack settings, with implications for defense and transfer learning in multimodal models.

Abstract

The vulnerability of Deep Neural Networks (DNNs) to adversarial attacks poses a significant challenge to their deployment in safety-critical applications. While extensive research has addressed various attack scenarios, the no-box attack setting where adversaries have no prior knowledge, including access to training data of the target model, remains relatively underexplored despite its practical relevance. This work presents a systematic investigation into leveraging large-scale Vision-Language Models (VLMs), particularly CLIP, as surrogate models for executing no-box attacks. Our theoretical and empirical analyses reveal a key limitation in the execution of no-box attacks stemming from insufficient discriminative capabilities for direct application of vanilla CLIP as a surrogate model. To address this limitation, we propose MF-CLIP: a novel framework that enhances CLIP's effectiveness as a surrogate model through margin-aware feature space optimization. Comprehensive evaluations across diverse architectures and datasets demonstrate that MF-CLIP substantially advances the state-of-the-art in no-box attacks, surpassing existing baselines by 15.23% on standard models and achieving a 9.52% improvement on adversarially trained models. Our code will be made publicly available to facilitate reproducibility and future research in this direction.

MF-CLIP: Leveraging CLIP as Surrogate Models for No-box Adversarial Attacks

TL;DR

This work tackles no-box adversarial attacks by leveraging Vision-Language Models as surrogates, identifying that vanilla CLIP has strong representations but limited discriminative margins for domain-specific attacks. It introduces MF-CLIP, a two-stage framework consisting of margin-based fine-tuning to widen inter-class margins and a generator-based adversarial-perturbation module to produce transferable examples, yielding substantial performance gains over state-of-the-art baselines. The authors support their approach with theoretical margin analysis and extensive experiments across seven datasets, multiple target architectures, and large-scale ImageNet/ViT scenarios, reporting average improvements of on standard models and on adversarially trained models. The results underscore the critical role of surrogate-model discriminative power in no-box transferability and suggest that margin-focused fine-tuning of foundation models can significantly enhance adversarial effectiveness in realistic attack settings, with implications for defense and transfer learning in multimodal models.

Abstract

The vulnerability of Deep Neural Networks (DNNs) to adversarial attacks poses a significant challenge to their deployment in safety-critical applications. While extensive research has addressed various attack scenarios, the no-box attack setting where adversaries have no prior knowledge, including access to training data of the target model, remains relatively underexplored despite its practical relevance. This work presents a systematic investigation into leveraging large-scale Vision-Language Models (VLMs), particularly CLIP, as surrogate models for executing no-box attacks. Our theoretical and empirical analyses reveal a key limitation in the execution of no-box attacks stemming from insufficient discriminative capabilities for direct application of vanilla CLIP as a surrogate model. To address this limitation, we propose MF-CLIP: a novel framework that enhances CLIP's effectiveness as a surrogate model through margin-aware feature space optimization. Comprehensive evaluations across diverse architectures and datasets demonstrate that MF-CLIP substantially advances the state-of-the-art in no-box attacks, surpassing existing baselines by 15.23% on standard models and achieving a 9.52% improvement on adversarially trained models. Our code will be made publicly available to facilitate reproducibility and future research in this direction.
Paper Structure (30 sections, 8 equations, 6 figures, 4 tables)

This paper contains 30 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparing MF-CLIP's attack success rate (ASR) against state-of-the-art methods across seven datasets. Results represent average performance across three target models (EfficientNet-B0, RegNetX-1.6GF, and ResNet-18). The visualization demonstrates MF-CLIP's consistent and substantial performance advantages across all datasets.
  • Figure 2: The t-SNE visualization results for the (a) vanilla CLIP and (b) MF-CLIP on the 37-class OxfordPet dataset.
  • Figure 3: The proposed MF-CLIP framework. Our approach consists of two main stages: (1) Margin-aware fine-tuning, which enhances CLIP's discriminative capabilities by optimizing inter-class margins while preserving its representational power; and (2) Adversarial example generation, where the fine-tuned model serves as a surrogate to generate highly transferable adversarial examples. In the diagram, black solid lines represent the forward process, while red dashed lines indicate the backward process (parameter updates).
  • Figure 4: Analysis for comparing different surrogate models with ResNet-50 backbone. The results clearly demonstrate that unrefined CLIP even performs worse than ImageNet models, while MF-CLIP significantly outperforms both.
  • Figure 5: Computational efficiency comparison on Flowers102 dataset. Time consumption vs. memory usage at different batch sizes. The × symbol indicates out of memory failures.
  • ...and 1 more figures