Table of Contents
Fetching ...

ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

Che Wang, Fuyao Zhang, Jiaming Zhang, Ziqi Zhang, Yinghui Wang, Longtao Huang, Jianbo Gao, Zhong Chen, Wei Yang Bryan Lim

TL;DR

This work proposes ICON, a probing-to-mitigation framework that neutralizes attacks while preserving task continuity, and demonstrates robust Out of Distribution(OOD) generalization and extends effectively to multi-modal agents, establishing a superior balance between security and efficiency.

Abstract

Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution. Existing defenses typically rely on strict filtering or refusal mechanisms, which suffer from a critical limitation: over-refusal, prematurely terminating valid agentic workflows. We propose ICON, a probing-to-mitigation framework that neutralizes attacks while preserving task continuity. Our key insight is that IPI attacks leave distinct over-focusing signatures in the latent space. We introduce a Latent Space Trace Prober to detect attacks based on high intensity scores. Subsequently, a Mitigating Rectifier performs surgical attention steering that selectively manipulate adversarial query key dependencies while amplifying task relevant elements to restore the LLM's functional trajectory. Extensive evaluations on multiple backbones show that ICON achieves a competitive 0.4% ASR, matching commercial grade detectors, while yielding a over 50% task utility gain. Furthermore, ICON demonstrates robust Out of Distribution(OOD) generalization and extends effectively to multi-modal agents, establishing a superior balance between security and efficiency.

ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

TL;DR

This work proposes ICON, a probing-to-mitigation framework that neutralizes attacks while preserving task continuity, and demonstrates robust Out of Distribution(OOD) generalization and extends effectively to multi-modal agents, establishing a superior balance between security and efficiency.

Abstract

Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution. Existing defenses typically rely on strict filtering or refusal mechanisms, which suffer from a critical limitation: over-refusal, prematurely terminating valid agentic workflows. We propose ICON, a probing-to-mitigation framework that neutralizes attacks while preserving task continuity. Our key insight is that IPI attacks leave distinct over-focusing signatures in the latent space. We introduce a Latent Space Trace Prober to detect attacks based on high intensity scores. Subsequently, a Mitigating Rectifier performs surgical attention steering that selectively manipulate adversarial query key dependencies while amplifying task relevant elements to restore the LLM's functional trajectory. Extensive evaluations on multiple backbones show that ICON achieves a competitive 0.4% ASR, matching commercial grade detectors, while yielding a over 50% task utility gain. Furthermore, ICON demonstrates robust Out of Distribution(OOD) generalization and extends effectively to multi-modal agents, establishing a superior balance between security and efficiency.
Paper Structure (22 sections, 10 equations, 3 figures, 5 tables)

This paper contains 22 sections, 10 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Our defense, ICON, presents a better security-utility trade-off (lower attack success rate, meanwhile higher utility) than other defenses against adaptive indirect prompt injection attacks.
  • Figure 2: The overall architecture of ICON. The framework comprises two core modules: (1) The Latent Space Trace Prober for identifying intrinsic anomalies in the model's internal representations, and (2) The Mitigating Rectifier for neutralizing threats and restoring functional integrity. Together, these mechanisms enable robust detection and effective rectification of indirect prompt injections.
  • Figure 3: Visualization of Mitigate Rectifier's Functionality