Table of Contents
Fetching ...

On the Feasibility of Hijacking MLLMs' Decision Chain via One Perturbation

Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang, Pinjia He

TL;DR

This work demonstrates that a single perturbation can hijack the entire decision chain of visual multimodal LLMs by exploiting input semantics. It introduces Semantic-Aware Universal Perturbations (SAUPs) and the SORT optimization framework, which operate in a normalized latent space and employ semantic separation to produce targeted outputs conditioned on semantics. A new Real Image Sequence Trajectories (RIST) dataset enables evaluation of fine-grained semantic control, and extensive experiments across Llava, Qwen, and InternVL show high attack success rates, including up to 93% for small target sets and substantial performance with larger target sets. The findings reveal a fundamental vulnerability in MLLMs to semantically aligned, universal perturbations and motivate future defenses and robust design for safe deployment in sequential decision-making tasks.

Abstract

Conventional adversarial attacks focus on manipulating a single decision of neural networks. However, real-world models often operate in a sequence of decisions, where an isolated mistake can be easily corrected, but cascading errors can lead to severe risks. This paper reveals a novel threat: a single perturbation can hijack the whole decision chain. We demonstrate the feasibility of manipulating a model's outputs toward multiple, predefined outcomes, such as simultaneously misclassifying "non-motorized lane" signs as "motorized lane" and "pedestrian" as "plastic bag". To expose this threat, we introduce Semantic-Aware Universal Perturbations (SAUPs), which induce varied outcomes based on the semantics of the inputs. We overcome optimization challenges by developing an effective algorithm, which searches for perturbations in normalized space with a semantic separation strategy. To evaluate the practical threat of SAUPs, we present RIST, a new real-world image dataset with fine-grained semantic annotations. Extensive experiments on three multimodal large language models demonstrate their vulnerability, achieving a 70% attack success rate when controlling five distinct targets using just an adversarial frame.

On the Feasibility of Hijacking MLLMs' Decision Chain via One Perturbation

TL;DR

This work demonstrates that a single perturbation can hijack the entire decision chain of visual multimodal LLMs by exploiting input semantics. It introduces Semantic-Aware Universal Perturbations (SAUPs) and the SORT optimization framework, which operate in a normalized latent space and employ semantic separation to produce targeted outputs conditioned on semantics. A new Real Image Sequence Trajectories (RIST) dataset enables evaluation of fine-grained semantic control, and extensive experiments across Llava, Qwen, and InternVL show high attack success rates, including up to 93% for small target sets and substantial performance with larger target sets. The findings reveal a fundamental vulnerability in MLLMs to semantically aligned, universal perturbations and motivate future defenses and robust design for safe deployment in sequential decision-making tasks.

Abstract

Conventional adversarial attacks focus on manipulating a single decision of neural networks. However, real-world models often operate in a sequence of decisions, where an isolated mistake can be easily corrected, but cascading errors can lead to severe risks. This paper reveals a novel threat: a single perturbation can hijack the whole decision chain. We demonstrate the feasibility of manipulating a model's outputs toward multiple, predefined outcomes, such as simultaneously misclassifying "non-motorized lane" signs as "motorized lane" and "pedestrian" as "plastic bag". To expose this threat, we introduce Semantic-Aware Universal Perturbations (SAUPs), which induce varied outcomes based on the semantics of the inputs. We overcome optimization challenges by developing an effective algorithm, which searches for perturbations in normalized space with a semantic separation strategy. To evaluate the practical threat of SAUPs, we present RIST, a new real-world image dataset with fine-grained semantic annotations. Extensive experiments on three multimodal large language models demonstrate their vulnerability, achieving a 70% attack success rate when controlling five distinct targets using just an adversarial frame.

Paper Structure

This paper contains 29 sections, 10 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Illustration of hijacking of the MLLM's decision chain. The adversary pastes the perturbation onto the camera. Once the vehicle starts and perceives images, the MLLM accurately outputs the attacker's predefined target content based on the image semantics, cumulatively guiding the vehicle to a predefined destination.
  • Figure 2: Illustration of SAUPs' workflow. The adversary first collects images from several classes and assigns a specific target label to each class. These images are then used to train the adversarial perturbation (e.g., an adversarial frame). Once trained, the perturbation can be applied to other unseen images, causing MLLMs to generate the exact target sentences conditioned on the semantic content of the input image.
  • Figure 3: Illustration of RIST, which consists of real-world image trajectories across two scenarios: AutoDriving and RoboTasking. RIST clusters semantically similar frames as the same class, with each class containing 10 images.
  • Figure 4: We generate a SAUP for Llava and extract image features from the penultimate layer. (a) The perturbation dominates the features of the perturbed images and deviates significantly from those of the clean images. (b) The perturbed image features corresponding to different semantics are separated from each other and occupy distinct locations in the feature space. (c) The perturbed images are aligned with their corresponding targets, resulting in high output confidence for those targets.
  • Figure 5: Illustration of the underlying mechanism of the Semantic-Aware phenomenon. Given three image sets $\mathcal{V}_0, \mathcal{V}_1, \mathcal{V}_2$ with distinct semantic contents, and an all-zero pixel image serving as an anchor. The SAUP maps all image features to a distant region in the latent space, while the respective semantic directions (SDs) of $\mathcal{V}_0, \mathcal{V}_1, \mathcal{V}_2$ cause slight deflections in these features, guiding them towards alignment with predefined targets.
  • ...and 5 more figures