Table of Contents
Fetching ...

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

Weikai Lu, Ziqian Zeng, Kehua Zhang, Haoran Li, Huiping Zhuang, Ruidong Wang, Cen Chen, Hao Peng

TL;DR

The paper tackles multimodal indirect prompt injection in multimodal LLMs by identifying a safety subspace in the model's activation space where instruction-following behaviors can be steered. It introduces ARGUS, a three-stage defense comprising Injection Detection, Adaptive Activation Steering with a top-N-layer search, and Post-filtering to verify defense success, all designed to preserve user utility while suppressing injected instructions. Extensive cross-modal benchmarks and experiments demonstrate ARGUS achieving near-zero safety violations with minimal latency, outperforming existing baselines across image, video, and audio modalities. The work provides a practical, modality-agnostic approach to defending MLLMs from unseen multimodal IPI attacks and outlines directions for further generalization and robustness improvements.

Abstract

Multimodal Large Language Models (MLLMs) are increasingly vulnerable to multimodal Indirect Prompt Injection (IPI) attacks, which embed malicious instructions in images, videos, or audio to hijack model behavior. Existing defenses, designed primarily for text-only LLMs, are unsuitable for countering these multimodal threats, as they are easily bypassed, modality-dependent, or generalize poorly. Inspired by activation steering researches, we hypothesize that a robust, general defense independent of modality can be achieved by steering the model's behavior in the representation space. Through extensive experiments, we discover that the instruction-following behavior of MLLMs is encoded in a subspace. Steering along directions within this subspace can enforce adherence to user instructions, forming the basis of a defense. However, we also found that a naive defense direction could be coupled with a utility-degrading direction, and excessive intervention strength harms model performance. To address this, we propose ARGUS, which searches for an optimal defense direction within the safety subspace that decouples from the utility degradation direction, further combining adaptive strength steering to achieve a better safety-utility trade-off. ARGUS also introduces lightweight injection detection stage to activate the defense on-demand, and a post-filtering stage to verify defense success. Experimental results show that ARGUS can achieve robust defense against multimodal IPI while maximally preserving the MLLM's utility.

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

TL;DR

The paper tackles multimodal indirect prompt injection in multimodal LLMs by identifying a safety subspace in the model's activation space where instruction-following behaviors can be steered. It introduces ARGUS, a three-stage defense comprising Injection Detection, Adaptive Activation Steering with a top-N-layer search, and Post-filtering to verify defense success, all designed to preserve user utility while suppressing injected instructions. Extensive cross-modal benchmarks and experiments demonstrate ARGUS achieving near-zero safety violations with minimal latency, outperforming existing baselines across image, video, and audio modalities. The work provides a practical, modality-agnostic approach to defending MLLMs from unseen multimodal IPI attacks and outlines directions for further generalization and robustness improvements.

Abstract

Multimodal Large Language Models (MLLMs) are increasingly vulnerable to multimodal Indirect Prompt Injection (IPI) attacks, which embed malicious instructions in images, videos, or audio to hijack model behavior. Existing defenses, designed primarily for text-only LLMs, are unsuitable for countering these multimodal threats, as they are easily bypassed, modality-dependent, or generalize poorly. Inspired by activation steering researches, we hypothesize that a robust, general defense independent of modality can be achieved by steering the model's behavior in the representation space. Through extensive experiments, we discover that the instruction-following behavior of MLLMs is encoded in a subspace. Steering along directions within this subspace can enforce adherence to user instructions, forming the basis of a defense. However, we also found that a naive defense direction could be coupled with a utility-degrading direction, and excessive intervention strength harms model performance. To address this, we propose ARGUS, which searches for an optimal defense direction within the safety subspace that decouples from the utility degradation direction, further combining adaptive strength steering to achieve a better safety-utility trade-off. ARGUS also introduces lightweight injection detection stage to activate the defense on-demand, and a post-filtering stage to verify defense success. Experimental results show that ARGUS can achieve robust defense against multimodal IPI while maximally preserving the MLLM's utility.

Paper Structure

This paper contains 25 sections, 15 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The validation accuracy of probes. The "probe" refers to the unconstrained-trained probe. "Orthogonal Probe 1" has weights orthogonal to "Probe.", and "Orthogonal Probe 2" has weights simultaneously orthogonal to other two.
  • Figure 2: The validation results of inference-time intervention. "No Steering" refers to the original performance without any steering applied. The "Performance Upper Bound" refers to the model's performance when the input consists of a single instruction. For the AIA metric, it reflects the model's performance when the user instructions are removed, while for the UIA metric, it measures the model's performance when the input does not contain an injection.
  • Figure 3: The overall framework of ARGUS. It includes three defense stages: injection detection, activation steering, and post-filtering.
  • Figure 4: The prompt templates used for GLUE tasks, where [Sentence1] and [Sentence2] serve as placeholders.
  • Figure 5: The prompts used for the WAN-2.1-VACE-1.3B model of Removal baseline.
  • ...and 1 more figures