Table of Contents
Fetching ...

Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

Haochuan Kevin Wang

Abstract

We present a stage-decomposed analysis of prompt injection attacks against five frontier LLM agents. Prior work measures task-level attack success rate (ASR); we localize the pipeline stage at which each model's defense activates. We instrument every run with a cryptographic canary token (SECRET-[A-F0-9]{8}) tracked through four kill-chain stages -- Exposed, Persisted, Relayed, Executed -- across four attack surfaces and five defense conditions (764 total runs, 428 no-defense attacked). Our central finding is that model safety is determined not by whether adversarial content is seen, but by whether it is propagated across pipeline stages. Concretely: (1) in our evaluation, exposure is 100% for all five models -- the safety gap is entirely downstream; (2) Claude strips injections at write_memory summarization (0/164 ASR), while GPT-4o-mini propagates canaries without loss (53% ASR, 95% CI: 41--65%); (3) DeepSeek exhibits 0% ASR on memory surfaces and 100% ASR on tool-stream surfaces from the same model -- a complete reversal across injection channels; (4) all four active defense conditions (write_filter, pi_detector, spotlighting, and their combination) produce 100% ASR due to threat-model surface mismatch; (5) a Claude relay node decontaminates downstream agents -- 0/40 canaries survived into shared memory.

Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

Abstract

We present a stage-decomposed analysis of prompt injection attacks against five frontier LLM agents. Prior work measures task-level attack success rate (ASR); we localize the pipeline stage at which each model's defense activates. We instrument every run with a cryptographic canary token (SECRET-[A-F0-9]{8}) tracked through four kill-chain stages -- Exposed, Persisted, Relayed, Executed -- across four attack surfaces and five defense conditions (764 total runs, 428 no-defense attacked). Our central finding is that model safety is determined not by whether adversarial content is seen, but by whether it is propagated across pipeline stages. Concretely: (1) in our evaluation, exposure is 100% for all five models -- the safety gap is entirely downstream; (2) Claude strips injections at write_memory summarization (0/164 ASR), while GPT-4o-mini propagates canaries without loss (53% ASR, 95% CI: 41--65%); (3) DeepSeek exhibits 0% ASR on memory surfaces and 100% ASR on tool-stream surfaces from the same model -- a complete reversal across injection channels; (4) all four active defense conditions (write_filter, pi_detector, spotlighting, and their combination) produce 100% ASR due to threat-model surface mismatch; (5) a Claude relay node decontaminates downstream agents -- 0/40 canaries survived into shared memory.

Paper Structure

This paper contains 21 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Benchmark architecture. Injections enter via one of four attack surfaces and are tracked by PropagationLogger through four kill-chain stages. Agent A (summarizer) reads poisoned content and writes to MemoryStore; Agent B (executor) reads and calls outbound tools. Utility and security are evaluated independently.
  • Figure 2: (a) Benign task success vs. targeted ASR. Three qualitatively distinct regimes emerge: GPT-4o-mini (90% utility, 53% ASR); DeepSeek/GPT-5-mini (partial resistance); Claude (appears Pareto-efficient within our evaluation: 100% utility, 0% ASR). (b) Utility under attack --- most models maintain task success while being compromised, confirming the dual-completion pattern.
  • Figure 3: Kill-chain decomposition (propagation scenario). Top: per-stage survival blocks for three representative models --- GPT-4o-mini propagates all four stages at 100%; GPT-5-mini partly filters at Persisted; Claude eliminates every canary at write_memory. Bottom: canary survival curves for all five models --- the sharp drop for both Claude variants between Exposed and Persisted marks the summarization-stage defense.
  • Figure 4: ASR heatmap (model $\times$ scenario) with Wilson 95% CI per cell. The orange border ($\star$) highlights DeepSeek Chat: 0/24 on memory_poison (three independent batches over 17 days) vs. 8/8 on tool_poison --- a 100-percentage-point swing from the same model on a different injection surface. This demonstrates that single-surface evaluation produces a complete mischaracterization of actual safety posture.
  • Figure 5: Per-step TF-IDF cosine distance from the task description (GPT-4o-mini, memory_poison). Clean (blue) and attacked (red) curves are indistinguishable through steps 1--2. Divergence appears at step 3---the harmful send_report call--- concurrent with, not before, the harm.
  • ...and 2 more figures