Table of Contents
Fetching ...

Efficient Generation of Targeted and Transferable Adversarial Examples for Vision-Language Models Via Diffusion Models

Qi Guo, Shanmin Pang, Xiaojun Jia, Yang Liu, Qing Guo

TL;DR

This work addresses the efficiency and realism shortcomings of targeted transfer-based attacks on Vision-Language Models by introducing AdvDiffVLM, a diffusion-model–driven framework that generates natural, unrestricted, targeted adversarial examples. It integrates Adaptive Ensemble Gradient Estimation to robustly estimate gradients from multiple surrogates and GradCAM-guided Mask Generation to distribute adversarial semantics across the image, improving both transferability and visual quality. Theoretical grounding ties score matching to embedding target semantics in the diffusion reverse process, enabling efficient, iterative refinement of adversarial content. Empirically, AdvDiffVLM achieves 5x–10x faster adversarial example generation and superior transferability across open-source and commercial VLMs, highlighting both vulnerability and the need for stronger robustness measures.

Abstract

Adversarial attacks, particularly \textbf{targeted} transfer-based attacks, can be used to assess the adversarial robustness of large visual-language models (VLMs), allowing for a more thorough examination of potential security flaws before deployment. However, previous transfer-based adversarial attacks incur high costs due to high iteration counts and complex method structure. Furthermore, due to the unnaturalness of adversarial semantics, the generated adversarial examples have low transferability. These issues limit the utility of existing methods for assessing robustness. To address these issues, we propose AdvDiffVLM, which uses diffusion models to generate natural, unrestricted and targeted adversarial examples via score matching. Specifically, AdvDiffVLM uses Adaptive Ensemble Gradient Estimation to modify the score during the diffusion model's reverse generation process, ensuring that the produced adversarial examples have natural adversarial targeted semantics, which improves their transferability. Simultaneously, to improve the quality of adversarial examples, we use the GradCAM-guided Mask method to disperse adversarial semantics throughout the image rather than concentrating them in a single area. Finally, AdvDiffVLM embeds more target semantics into adversarial examples after multiple iterations. Experimental results show that our method generates adversarial examples 5x to 10x faster than state-of-the-art transfer-based adversarial attacks while maintaining higher quality adversarial examples. Furthermore, compared to previous transfer-based adversarial attacks, the adversarial examples generated by our method have better transferability. Notably, AdvDiffVLM can successfully attack a variety of commercial VLMs in a black-box environment, including GPT-4V.

Efficient Generation of Targeted and Transferable Adversarial Examples for Vision-Language Models Via Diffusion Models

TL;DR

This work addresses the efficiency and realism shortcomings of targeted transfer-based attacks on Vision-Language Models by introducing AdvDiffVLM, a diffusion-model–driven framework that generates natural, unrestricted, targeted adversarial examples. It integrates Adaptive Ensemble Gradient Estimation to robustly estimate gradients from multiple surrogates and GradCAM-guided Mask Generation to distribute adversarial semantics across the image, improving both transferability and visual quality. Theoretical grounding ties score matching to embedding target semantics in the diffusion reverse process, enabling efficient, iterative refinement of adversarial content. Empirically, AdvDiffVLM achieves 5x–10x faster adversarial example generation and superior transferability across open-source and commercial VLMs, highlighting both vulnerability and the need for stronger robustness measures.

Abstract

Adversarial attacks, particularly \textbf{targeted} transfer-based attacks, can be used to assess the adversarial robustness of large visual-language models (VLMs), allowing for a more thorough examination of potential security flaws before deployment. However, previous transfer-based adversarial attacks incur high costs due to high iteration counts and complex method structure. Furthermore, due to the unnaturalness of adversarial semantics, the generated adversarial examples have low transferability. These issues limit the utility of existing methods for assessing robustness. To address these issues, we propose AdvDiffVLM, which uses diffusion models to generate natural, unrestricted and targeted adversarial examples via score matching. Specifically, AdvDiffVLM uses Adaptive Ensemble Gradient Estimation to modify the score during the diffusion model's reverse generation process, ensuring that the produced adversarial examples have natural adversarial targeted semantics, which improves their transferability. Simultaneously, to improve the quality of adversarial examples, we use the GradCAM-guided Mask method to disperse adversarial semantics throughout the image rather than concentrating them in a single area. Finally, AdvDiffVLM embeds more target semantics into adversarial examples after multiple iterations. Experimental results show that our method generates adversarial examples 5x to 10x faster than state-of-the-art transfer-based adversarial attacks while maintaining higher quality adversarial examples. Furthermore, compared to previous transfer-based adversarial attacks, the adversarial examples generated by our method have better transferability. Notably, AdvDiffVLM can successfully attack a variety of commercial VLMs in a black-box environment, including GPT-4V.
Paper Structure (21 sections, 9 equations, 14 figures, 8 tables, 1 algorithm)

This paper contains 21 sections, 9 equations, 14 figures, 8 tables, 1 algorithm.

Figures (14)

  • Figure 1: Comparison of different transfer-based attacks and our method on VLMs. (a) Comparison of attack performance. We select BLIP2 blip2 and Img2LLM image2llm as the representation models of VLMs. We select existing transfer-based attacks in conjunction with AttackVLM attackvlm as comparison methods, including Ens mifgsm, SVRE svre, CWA cwa, SSA ssa and SIA sia. We report the CLIP$_{tar}$ score, which is the similarity between the response generated by the input images. (b) Comparison of image quality. We enlarge the local area of the adversarial examples to enhance visual effects. It is evident that adversarial examples generated by transfer-based attacks exhibit notable noise. Our method has better visual effects. Magnify images for improved contrast.
  • Figure 2: The CLIP$_{img}$ score varies with the step sizes. Here, CLIP$_{img}$ is the similarity between the adversarial examples and the adversarial target images, which is calculated by the visual encoder of CLIP ViT-B/32. We choose SSA ssa as the representative of transfer-based attacks.
  • Figure 3: The main framework of the AdvDiffVLM for efficiently generating transferable unrestricted adversarial examples. AdvDiffVLM mainly includes two components: AEGE and GCMG. Details are respectively described in Secs. \ref{['sec:adaptive']} and \ref{['sec:gradcam']}. Please refer to Section \ref{['sec:method']} for specific symbol meanings.
  • Figure 4: The pipeline of the AEGE.
  • Figure 5: Different theoretical foundations and implementation methods between AdvDiffuser and our method. Where "Sampling" refers to $\tilde{x}_{t-1} = \left( \tilde{x}_t - {(1-\alpha_t)} \cdot \boldsymbol{\varepsilon}_\theta (\tilde{x}_t, t) / {\sqrt{1-\bar{\alpha}_t}} \right) / {\sqrt{\alpha_t}}$ and "Score Match" refers to $\tilde{x}_{t-1} = {(\tilde{x}_t+(1-\alpha_t)\cdot \text{score})}$
  • ...and 9 more figures