FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs

Jinmin Li; Kuofeng Gao; Yang Bai; Jingyun Zhang; Shu-tao Xia; Yisen Wang

FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs

Jinmin Li, Kuofeng Gao, Yang Bai, Jingyun Zhang, Shu-tao Xia, Yisen Wang

TL;DR

The first adversarial attack tailored for video-based LLMs is proposed by crafting flow-based multi-modal adversarial perturbations on a small fraction of frames within a video, dubbed FMM-Attack, which can effectively induce video-based LLMs to generate incorrect answers when videos are added with imperceptible adversarial perturbations.

Abstract

Despite the remarkable performance of video-based large language models (LLMs), their adversarial threat remains unexplored. To fill this gap, we propose the first adversarial attack tailored for video-based LLMs by crafting flow-based multi-modal adversarial perturbations on a small fraction of frames within a video, dubbed FMM-Attack. Extensive experiments show that our attack can effectively induce video-based LLMs to generate incorrect answers when videos are added with imperceptible adversarial perturbations. Intriguingly, our FMM-Attack can also induce garbling in the model output, prompting video-based LLMs to hallucinate. Overall, our observations inspire a further understanding of multi-modal robustness and safety-related feature alignment across different modalities, which is of great importance for various large multi-modal models. Our code is available at https://github.com/THU-Kingmin/FMM-Attack.

FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs

TL;DR

Abstract

Paper Structure (22 sections, 7 equations, 11 figures, 8 tables, 2 algorithms)

This paper contains 22 sections, 7 equations, 11 figures, 8 tables, 2 algorithms.

Introduction
Related Work
Video-based Large Language Models
Adversarial Attack
Methodology
Threat model
Preliminary: the Pipeline of Video-based LLMs
Problem Formulation
Optimization Objective
Experiments
Implementation Details
Main Results
Discussions
Ablation Studies
Conclusion
...and 7 more sections

Figures (11)

Figure 1: Visualization of the transmission cross video and LLM features. In Fig. (a), when attacking in video feature space, the clustering effect of garbled video features can result in garbled clusters in LLM features. In Fig. (b), when attacking in LLM feature space, garbled videos barely form clusters in LLM features, let alone in video features. This illustrates the asymmetric transmission between video and LLM features.
Figure 2: Schematics of our FMM-Attack. The figure demonstrates our attack approach for maximizing the video-video features as defined in Eq. \ref{['loss:video']} and the LLM-LLM features in Eq. \ref{['loss:LLM']}. We refer to adversarial examples generated by our attack strategies as $\hat{\mathbf{X}} = \mathbf{X} + \Delta$, with $\Delta$ representing the adversarial perturbation.
Figure 3: Perturbed videos generated by Video-ChatGPT.
Figure 4: Relationship between optical flow and key frames. 'Clip Score of Adjacent Frames' describes the similarity between the current frame and its adjacent frames, the smaller this score is the more different the current frame is. 'Clip Score of Answer and Current Frame' indicates the similarity between the current frame and the answer corresponding to the user's input question, the larger the score indicates that the current frame contains more information about the answer. The frames selected by flow-based masks in our FMM-Attack are key frames in the video.
Figure 5: Comparison of different types of attacks on the garbling rate. Max Modify denotes the maximum pixel value that can be modified, while the Garble Rate represents the percentage of responses that are garbled.
...and 6 more figures

FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs

TL;DR

Abstract

FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (11)