Table of Contents
Fetching ...

Prompt Injection as Role Confusion

Charles Ye, Jasmine Cui, Dylan Hadfield-Menell

Abstract

Language models remain vulnerable to prompt injection attacks despite extensive safety training. We trace this failure to role confusion: models infer roles from how text is written, not where it comes from. We design novel role probes to capture how models internally identify "who is speaking." These reveal why prompt injection works: untrusted text that imitates a role inherits that role's authority. We test this insight by injecting spoofed reasoning into user prompts and tool outputs, achieving average success rates of 60% on StrongREJECT and 61% on agent exfiltration, across multiple open- and closed-weight models with near-zero baselines. Strikingly, the degree of internal role confusion strongly predicts attack success before generation begins. Our findings reveal a fundamental gap: security is defined at the interface but authority is assigned in latent space. More broadly, we introduce a unifying, mechanistic framework for prompt injection, demonstrating that diverse prompt-injection attacks exploit the same underlying role-confusion mechanism.

Prompt Injection as Role Confusion

Abstract

Language models remain vulnerable to prompt injection attacks despite extensive safety training. We trace this failure to role confusion: models infer roles from how text is written, not where it comes from. We design novel role probes to capture how models internally identify "who is speaking." These reveal why prompt injection works: untrusted text that imitates a role inherits that role's authority. We test this insight by injecting spoofed reasoning into user prompts and tool outputs, achieving average success rates of 60% on StrongREJECT and 61% on agent exfiltration, across multiple open- and closed-weight models with near-zero baselines. Strikingly, the degree of internal role confusion strongly predicts attack success before generation begins. Our findings reveal a fundamental gap: security is defined at the interface but authority is assigned in latent space. More broadly, we introduce a unifying, mechanistic framework for prompt injection, demonstrating that diverse prompt-injection attacks exploit the same underlying role-confusion mechanism.
Paper Structure (103 sections, 2 equations, 32 figures, 3 tables)

This paper contains 103 sections, 2 equations, 32 figures, 3 tables.

Figures (32)

  • Figure 1: Text that sounds like chain-of-thought inherits its privilege. Three frontier safety models comply with otherwise unjustifiable requests because spoofed reasoning-styled text confers authority.
  • Figure 2: Attack success on StrongREJECT. Models with near-perfect defense against standard jailbreaks (pink) collapse under CoT Forgery (orange).
  • Figure 3: ASRs in an agentic data exfiltration task. Standard prompt injection (gray) largely fails; CoT Forgery (red) dramatically increases success.
  • Figure 4: Style is causal. (a) A CoT forgery and its destyled variant for a model (gpt-oss-20b). The argument is preserved; only markers of a model's characteristic reasoning style are removed. (b) The same argument, phrased differently, loses its authority.
  • Figure 5: Data construction for role probes. We embed non-instruct web text within different role tags. Content is held constant—the probe must learn the model's internal representation of role itself. Simplified role tags here for clarity; actual experiments use model-native tokens.
  • ...and 27 more figures