On the Feasibility of Hijacking MLLMs' Decision Chain via One Perturbation
Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang, Pinjia He
TL;DR
This work demonstrates that a single perturbation can hijack the entire decision chain of visual multimodal LLMs by exploiting input semantics. It introduces Semantic-Aware Universal Perturbations (SAUPs) and the SORT optimization framework, which operate in a normalized latent space and employ semantic separation to produce targeted outputs conditioned on semantics. A new Real Image Sequence Trajectories (RIST) dataset enables evaluation of fine-grained semantic control, and extensive experiments across Llava, Qwen, and InternVL show high attack success rates, including up to 93% for small target sets and substantial performance with larger target sets. The findings reveal a fundamental vulnerability in MLLMs to semantically aligned, universal perturbations and motivate future defenses and robust design for safe deployment in sequential decision-making tasks.
Abstract
Conventional adversarial attacks focus on manipulating a single decision of neural networks. However, real-world models often operate in a sequence of decisions, where an isolated mistake can be easily corrected, but cascading errors can lead to severe risks. This paper reveals a novel threat: a single perturbation can hijack the whole decision chain. We demonstrate the feasibility of manipulating a model's outputs toward multiple, predefined outcomes, such as simultaneously misclassifying "non-motorized lane" signs as "motorized lane" and "pedestrian" as "plastic bag". To expose this threat, we introduce Semantic-Aware Universal Perturbations (SAUPs), which induce varied outcomes based on the semantics of the inputs. We overcome optimization challenges by developing an effective algorithm, which searches for perturbations in normalized space with a semantic separation strategy. To evaluate the practical threat of SAUPs, we present RIST, a new real-world image dataset with fine-grained semantic annotations. Extensive experiments on three multimodal large language models demonstrate their vulnerability, achieving a 70% attack success rate when controlling five distinct targets using just an adversarial frame.
