CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

Minbeom Kim; Mihir Parmar; Phillip Wallis; Lesly Miculicich; Kyomin Jung; Krishnamurthy Dj Dvijotham; Long T. Le; Tomas Pfister

CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

Minbeom Kim, Mihir Parmar, Phillip Wallis, Lesly Miculicich, Kyomin Jung, Krishnamurthy Dj Dvijotham, Long T. Le, Tomas Pfister

TL;DR

CausalArmor addresses Indirect Prompt Injection in tool-using LLMs by detecting a causal dominance shift at privileged decisions using Leave-One-Out attribution. It introduces a selective guardrail that triggers expensive sanitization only when an untrusted span dominates user intent, combined with retroactive CoT masking to erase poisoned reasoning traces. The framework leverages batched proxy-model inference and token-length normalization to keep latency low while maintaining high security, achieving near-zero ASR on AgentDojo and DoomArena and preserving Benign Utility close to the No Defense baseline. The approach is theoretically grounded, with bounds showing exponential suppression of malicious actions under two reasonable assumptions, and empirically validated across strong benchmarks. This work offers an interpretable, efficient alternative to always-on sanitization, suitable for practical deployment in real-world, untrusted-content environments.

Abstract

AI agents equipped with tool-calling capabilities are susceptible to Indirect Prompt Injection (IPI) attacks. In this attack scenario, malicious commands hidden within untrusted content trick the agent into performing unauthorized actions. Existing defenses can reduce attack success but often suffer from the over-defense dilemma: they deploy expensive, always-on sanitization regardless of actual threat, thereby degrading utility and latency even in benign scenarios. We revisit IPI through a causal ablation perspective: a successful injection manifests as a dominance shift where the user request no longer provides decisive support for the agent's privileged action, while a particular untrusted segment, such as a retrieved document or tool output, provides disproportionate attributable influence. Based on this signature, we propose CausalArmor, a selective defense framework that (i) computes lightweight, leave-one-out ablation-based attributions at privileged decision points, and (ii) triggers targeted sanitization only when an untrusted segment dominates the user intent. Additionally, CausalArmor employs retroactive Chain-of-Thought masking to prevent the agent from acting on ``poisoned'' reasoning traces. We present a theoretical analysis showing that sanitization based on attribution margins conditionally yields an exponentially small upper bound on the probability of selecting malicious actions. Experiments on AgentDojo and DoomArena demonstrate that CausalArmor matches the security of aggressive defenses while improving explainability and preserving utility and latency of AI agents.

CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

TL;DR

Abstract

Paper Structure (85 sections, 1 theorem, 23 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 85 sections, 1 theorem, 23 equations, 9 figures, 4 tables, 1 algorithm.

Introduction
Key Idea: detect dominance shift at privileged decisions.
CausalArmor.
Empirical Results.
Contributions.
Related Work
Indirect Prompt Injection in Tool-Using Agents
Defenses Against IPI and Over-Defense Dilemmas
Prompting.
Trained Classifier.
System-based defenses.
Where CausalArmor differs.
Attribution and Counterfactual Influence
Indirect Prompt Injection: Setup and Formalization
Notation and agentic setting
...and 70 more sections

Key Result

Proposition 4.1

Under Eq. eq:beta and Eq. eq:eff_def, the probability that an episode of length $T$ executes any malicious privileged action is bounded by:

Figures (9)

Figure 1: Overview of the CausalArmor Workflow. When a privileged action is proposed, CausalArmor computes the attributable influence ($\Delta$) of the User Request versus Untrusted Spans. Under a successful IPI attack, the untrusted span overtakes the user request in driving the model's output ($\Delta_S \gg \Delta_U$), exhibiting a Dominance Shift. Detecting this signature allows the system to selectively intervene (Path B) by sanitizing the specific trigger and masking the reasoning history, while normal requests proceed without overhead (Path A).
Figure 2: Causal attribution signature of IPI (Dominance Shift). Aggregated results on the AgentDojo benchmark. In benign scenarios, privileged decisions are grounded primarily by the user request ($\Delta_U$ dominates). Under IPI, the user grounding collapses while a specific untrusted span becomes highly dominant, producing a large margin $\Delta_S - \Delta_U$.
Figure 3: Main results on AgentDojo. Defense types: Prompting, Trained classifiers, System-based. Top (attack scenarios): Utility under attack (UA) versus attack success rate (ASR). CausalArmor reduces ASR to near zero while maintaining UA close to the No Defense baseline. Bottom (benign scenarios): Benign utility (BU) versus benign latency (BL). CausalArmor achieves strong security with low benign overhead, remaining near No Defense in both BU and BL compared to prior defenses. Detailed results are in Table \ref{['tab:main_results']}.
Figure 4: Default CausalArmor ($\tau=0$) mitigates the over-defense dilemma by achieving strong security with minimal utility and latency loss. Adjusting margin $\tau$ allows for fine-grained control over this security-general usefulness trade-off. Trade-off results on more LLM agents are on Figure \ref{['fig:tau-tradeoff-expanded']}.
Figure 5: Attack Success Rate vs. Relative Latency (normalized to Gemini-3-Pro) for various proxy models.
...and 4 more figures

Theorems & Definitions (1)

Proposition 4.1: Exponential Decay of IPI Success

CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

TL;DR

Abstract

CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (1)