Table of Contents
Fetching ...

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

Peng Xie, Yequan Bie, Jianda Mao, Yangqiu Song, Yang Wang, Hao Chen, Kani Chen

TL;DR

Chain of Attack is presented, which iteratively enhances the generation of adversarial examples based on the multi-modal semantic update using a series of intermediate attacking steps, achieving superior adversarial transferability and efficiency.

Abstract

Pre-trained vision-language models (VLMs) have showcased remarkable performance in image and natural language understanding, such as image captioning and response generation. As the practical applications of vision-language models become increasingly widespread, their potential safety and robustness issues raise concerns that adversaries may evade the system and cause these models to generate toxic content through malicious attacks. Therefore, evaluating the robustness of open-source VLMs against adversarial attacks has garnered growing attention, with transfer-based attacks as a representative black-box attacking strategy. However, most existing transfer-based attacks neglect the importance of the semantic correlations between vision and text modalities, leading to sub-optimal adversarial example generation and attack performance. To address this issue, we present Chain of Attack (CoA), which iteratively enhances the generation of adversarial examples based on the multi-modal semantic update using a series of intermediate attacking steps, achieving superior adversarial transferability and efficiency. A unified attack success rate computing method is further proposed for automatic evasion evaluation. Extensive experiments conducted under the most realistic and high-stakes scenario, demonstrate that our attacking strategy can effectively mislead models to generate targeted responses using only black-box attacks without any knowledge of the victim models. The comprehensive robustness evaluation in our paper provides insight into the vulnerabilities of VLMs and offers a reference for the safety considerations of future model developments.

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

TL;DR

Chain of Attack is presented, which iteratively enhances the generation of adversarial examples based on the multi-modal semantic update using a series of intermediate attacking steps, achieving superior adversarial transferability and efficiency.

Abstract

Pre-trained vision-language models (VLMs) have showcased remarkable performance in image and natural language understanding, such as image captioning and response generation. As the practical applications of vision-language models become increasingly widespread, their potential safety and robustness issues raise concerns that adversaries may evade the system and cause these models to generate toxic content through malicious attacks. Therefore, evaluating the robustness of open-source VLMs against adversarial attacks has garnered growing attention, with transfer-based attacks as a representative black-box attacking strategy. However, most existing transfer-based attacks neglect the importance of the semantic correlations between vision and text modalities, leading to sub-optimal adversarial example generation and attack performance. To address this issue, we present Chain of Attack (CoA), which iteratively enhances the generation of adversarial examples based on the multi-modal semantic update using a series of intermediate attacking steps, achieving superior adversarial transferability and efficiency. A unified attack success rate computing method is further proposed for automatic evasion evaluation. Extensive experiments conducted under the most realistic and high-stakes scenario, demonstrate that our attacking strategy can effectively mislead models to generate targeted responses using only black-box attacks without any knowledge of the victim models. The comprehensive robustness evaluation in our paper provides insight into the vulnerabilities of VLMs and offers a reference for the safety considerations of future model developments.

Paper Structure

This paper contains 13 sections, 9 equations, 11 figures, 7 tables, 1 algorithm.

Figures (11)

  • Figure 1: Comparison of the proposed CoA with other attacking strategies. The CLIP score results of Unidiffuser unidiffuser are reported. Our method shows both superior performance and efficiency.
  • Figure 2: The pipeline of the Chain of Attack (CoA) framework. (a) Our framework proposes using modality-aware embeddings to capture the semantic correspondence between images and texts. To enhance the adversarial transferability, we use a chain of attacks that explicitly updates the adversarial examples based on their previous multi-modal semantics in a step-by-step manner. A Targeted Contrastive Matching objective is further proposed to align and differentiate the semantics among clean, adversarial, and target reference examples. (b) Targeted response generation is conducted during inference, where the victim models give responses based on the adversarial examples. We further introduce a unified ASR computing strategy for automatic and comprehensive robustness evaluation of VLMs in response generation.
  • Figure 3: Illustration of the attacking chain. Given the modality-aware embeddings of clean examples and target examples, the adversarial examples including the image perturbations and the corresponding textual information are explicitly updated in a step-by-step manner with the guidance of Targeted Contrastive Matching. This Chain of Attack enhances the adversarial example generation while providing a clear and human-understandable "evolution" process, e.g., from "A bird in the park" to "Two young boys playing baseball on a field".
  • Figure 4: Examples of the proposed LLM-based attack success rate evaluation. From left to right, the examples depict a completely successful attack case, a fooled-only case, and a failed attack case, respectively. The output score for each case is at the bottom.
  • Figure 5: (a) Visual interpretation of the adversarial examples. $AM$ represents the attention map, which is based on the image-text similarity. The clean text and the target image are generated based on the clean image and the selected target reference text, respectively. (b) The effect of $\epsilon$ on Unidiffuser. The generated captions in red (i.e., $\epsilon \geq 8$) are close to the target text.
  • ...and 6 more figures