Table of Contents
Fetching ...

FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs

Jinmin Li, Kuofeng Gao, Yang Bai, Jingyun Zhang, Shu-tao Xia, Yisen Wang

TL;DR

The first adversarial attack tailored for video-based LLMs is proposed by crafting flow-based multi-modal adversarial perturbations on a small fraction of frames within a video, dubbed FMM-Attack, which can effectively induce video-based LLMs to generate incorrect answers when videos are added with imperceptible adversarial perturbations.

Abstract

Despite the remarkable performance of video-based large language models (LLMs), their adversarial threat remains unexplored. To fill this gap, we propose the first adversarial attack tailored for video-based LLMs by crafting flow-based multi-modal adversarial perturbations on a small fraction of frames within a video, dubbed FMM-Attack. Extensive experiments show that our attack can effectively induce video-based LLMs to generate incorrect answers when videos are added with imperceptible adversarial perturbations. Intriguingly, our FMM-Attack can also induce garbling in the model output, prompting video-based LLMs to hallucinate. Overall, our observations inspire a further understanding of multi-modal robustness and safety-related feature alignment across different modalities, which is of great importance for various large multi-modal models. Our code is available at https://github.com/THU-Kingmin/FMM-Attack.

FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs

TL;DR

The first adversarial attack tailored for video-based LLMs is proposed by crafting flow-based multi-modal adversarial perturbations on a small fraction of frames within a video, dubbed FMM-Attack, which can effectively induce video-based LLMs to generate incorrect answers when videos are added with imperceptible adversarial perturbations.

Abstract

Despite the remarkable performance of video-based large language models (LLMs), their adversarial threat remains unexplored. To fill this gap, we propose the first adversarial attack tailored for video-based LLMs by crafting flow-based multi-modal adversarial perturbations on a small fraction of frames within a video, dubbed FMM-Attack. Extensive experiments show that our attack can effectively induce video-based LLMs to generate incorrect answers when videos are added with imperceptible adversarial perturbations. Intriguingly, our FMM-Attack can also induce garbling in the model output, prompting video-based LLMs to hallucinate. Overall, our observations inspire a further understanding of multi-modal robustness and safety-related feature alignment across different modalities, which is of great importance for various large multi-modal models. Our code is available at https://github.com/THU-Kingmin/FMM-Attack.
Paper Structure (22 sections, 7 equations, 11 figures, 8 tables, 2 algorithms)

This paper contains 22 sections, 7 equations, 11 figures, 8 tables, 2 algorithms.

Figures (11)

  • Figure 1: Visualization of the transmission cross video and LLM features. In Fig. (a), when attacking in video feature space, the clustering effect of garbled video features can result in garbled clusters in LLM features. In Fig. (b), when attacking in LLM feature space, garbled videos barely form clusters in LLM features, let alone in video features. This illustrates the asymmetric transmission between video and LLM features.
  • Figure 2: Schematics of our FMM-Attack. The figure demonstrates our attack approach for maximizing the video-video features as defined in Eq. \ref{['loss:video']} and the LLM-LLM features in Eq. \ref{['loss:LLM']}. We refer to adversarial examples generated by our attack strategies as $\hat{\mathbf{X}} = \mathbf{X} + \Delta$, with $\Delta$ representing the adversarial perturbation.
  • Figure 3: Perturbed videos generated by Video-ChatGPT.
  • Figure 4: Relationship between optical flow and key frames. 'Clip Score of Adjacent Frames' describes the similarity between the current frame and its adjacent frames, the smaller this score is the more different the current frame is. 'Clip Score of Answer and Current Frame' indicates the similarity between the current frame and the answer corresponding to the user's input question, the larger the score indicates that the current frame contains more information about the answer. The frames selected by flow-based masks in our FMM-Attack are key frames in the video.
  • Figure 5: Comparison of different types of attacks on the garbling rate. Max Modify denotes the maximum pixel value that can be modified, while the Garble Rate represents the percentage of responses that are garbled.
  • ...and 6 more figures