Table of Contents
Fetching ...

Transferability of Adversarial Attacks in Video-based MLLMs: A Cross-modal Image-to-Video Approach

Linhao Huang, Xue Jiang, Zhiqiang Wang, Wentao Mo, Xi Xiao, Bo Han, Yongjie Yin, Feng Zheng

TL;DR

The paper addresses the vulnerability of video-based multimodal LLMs (V-MLLMs) to adversarial inputs and the practical concern of transferring such attacks across models in black-box settings. It proposes the I2V-MLLM attack, which uses an image-based multimodal surrogate (I-MLLM) to craft perturbations that disrupt both vision features and multimodal alignment, with a perturbation propagation mechanism to cope with unknown frame sampling. Through experiments on MSVD-QA, MSRVTT-QA, and ActivityNet-200, the method achieves strong cross-model transferability, with average attack success rates approaching and sometimes rivaling white-box baselines, and shows notable degradation in accuracy and GPT-assessed metrics across multiple target V-MLLMs. The findings underscore the need for robust defenses in V-MLLM deployments and illustrate how cross-modal surrogates can enhance adversarial transferability in video-language tasks.

Abstract

Video-based multimodal large language models (V-MLLMs) have shown vulnerability to adversarial examples in video-text multimodal tasks. However, the transferability of adversarial videos to unseen models - a common and practical real-world scenario - remains unexplored. In this paper, we pioneer an investigation into the transferability of adversarial video samples across V-MLLMs. We find that existing adversarial attack methods face significant limitations when applied in black-box settings for V-MLLMs, which we attribute to the following shortcomings: (1) lacking generalization in perturbing video features, (2) focusing only on sparse key-frames, and (3) failing to integrate multimodal information. To address these limitations and deepen the understanding of V-MLLM vulnerabilities in black-box scenarios, we introduce the Image-to-Video MLLM (I2V-MLLM) attack. In I2V-MLLM, we utilize an image-based multimodal large language model (I-MLLM) as a surrogate model to craft adversarial video samples. Multimodal interactions and spatiotemporal information are integrated to disrupt video representations within the latent space, improving adversarial transferability. Additionally, a perturbation propagation technique is introduced to handle different unknown frame sampling strategies. Experimental results demonstrate that our method can generate adversarial examples that exhibit strong transferability across different V-MLLMs on multiple video-text multimodal tasks. Compared to white-box attacks on these models, our black-box attacks (using BLIP-2 as a surrogate model) achieve competitive performance, with average attack success rate (AASR) of 57.98% on MSVD-QA and 58.26% on MSRVTT-QA for Zero-Shot VideoQA tasks, respectively.

Transferability of Adversarial Attacks in Video-based MLLMs: A Cross-modal Image-to-Video Approach

TL;DR

The paper addresses the vulnerability of video-based multimodal LLMs (V-MLLMs) to adversarial inputs and the practical concern of transferring such attacks across models in black-box settings. It proposes the I2V-MLLM attack, which uses an image-based multimodal surrogate (I-MLLM) to craft perturbations that disrupt both vision features and multimodal alignment, with a perturbation propagation mechanism to cope with unknown frame sampling. Through experiments on MSVD-QA, MSRVTT-QA, and ActivityNet-200, the method achieves strong cross-model transferability, with average attack success rates approaching and sometimes rivaling white-box baselines, and shows notable degradation in accuracy and GPT-assessed metrics across multiple target V-MLLMs. The findings underscore the need for robust defenses in V-MLLM deployments and illustrate how cross-modal surrogates can enhance adversarial transferability in video-language tasks.

Abstract

Video-based multimodal large language models (V-MLLMs) have shown vulnerability to adversarial examples in video-text multimodal tasks. However, the transferability of adversarial videos to unseen models - a common and practical real-world scenario - remains unexplored. In this paper, we pioneer an investigation into the transferability of adversarial video samples across V-MLLMs. We find that existing adversarial attack methods face significant limitations when applied in black-box settings for V-MLLMs, which we attribute to the following shortcomings: (1) lacking generalization in perturbing video features, (2) focusing only on sparse key-frames, and (3) failing to integrate multimodal information. To address these limitations and deepen the understanding of V-MLLM vulnerabilities in black-box scenarios, we introduce the Image-to-Video MLLM (I2V-MLLM) attack. In I2V-MLLM, we utilize an image-based multimodal large language model (I-MLLM) as a surrogate model to craft adversarial video samples. Multimodal interactions and spatiotemporal information are integrated to disrupt video representations within the latent space, improving adversarial transferability. Additionally, a perturbation propagation technique is introduced to handle different unknown frame sampling strategies. Experimental results demonstrate that our method can generate adversarial examples that exhibit strong transferability across different V-MLLMs on multiple video-text multimodal tasks. Compared to white-box attacks on these models, our black-box attacks (using BLIP-2 as a surrogate model) achieve competitive performance, with average attack success rate (AASR) of 57.98% on MSVD-QA and 58.26% on MSRVTT-QA for Zero-Shot VideoQA tasks, respectively.
Paper Structure (26 sections, 9 equations, 14 figures, 12 tables, 1 algorithm)

This paper contains 26 sections, 9 equations, 14 figures, 12 tables, 1 algorithm.

Figures (14)

  • Figure 1: An example of transferable adversarial attack on different target V-MLLMs for Zero-Shot VideoQA task.
  • Figure 2: Overview of our proposed method. (a) I2V-MLLM Attack: The clean video is divided into $K$ clips. Key frames are extracted from these clips to form the clean frames $X$, which is then fed into the vision model to extract clean frame-level embeddings $F_V(X)$. These embeddings are subsequently aggregated via spatiotemporal pooling to obtain clean spatiotemporal embeddings $F_V^{st}(X)$. Perturbations are initialized and added to clean frames $X$ to generate adversarial frames $X_{adv}$. The same process is applied to extract $F_V(X_{adv})$ and $F_V^{st}(X_{adv})$. An LLM reformulates the QA pairs into a caption set $T$. $F_V(X)$, $F_V(X_{adv})$, and $T$ are then passed through the projector to extract visual features $F_P^v(X)$, adversarial visual features $F_P^v(X_{adv})$, and textual features $F_P^t(T)$, respectively. Perturbations are updated via the PGD method by minimizing three cosine similarity-based losses: $L_V$, $L_P^v$, and $L_P^{v2t}$. (b) Perturbation Propagation: The final perturbations applied to key-frames are propagated back to their corresponding video clips to construct the adversarial video. (c) Attack Different Target V-MLLMs.
  • Figure 3: Attack success rates (ASR, %) of the I2V-MLLM attack with different loss functions.
  • Figure 4: AASR (%) of the I2V-MLLM attack with various key-frame ratios, comparing scenarios with and without perturbation propagation. 'Prop.' represents 'Propagation'.
  • Figure 5: An example of using GPT-4o-mini to evaluate Accuracy and GPT Score for the VideoQA task, following the methodology in maaz2024videochatgptdetailedvideounderstanding.
  • ...and 9 more figures