Table of Contents
Fetching ...

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Xin Chen, Jie Zhang, Florian Tramer

TL;DR

Prompt injection poses a critical security risk for LLM-driven agents interacting with external content. AutoInject formulates attack generation as a reinforcement learning problem, using dense, comparison-based rewards and a compact 1.5B suffix generator to optimize for injection success while preserving benign task utility. It supports online query-based attacks and universal, transferable suffixes across unseen models and tasks, outperforming template-based and optimization-based baselines on the AgentDojo benchmark and revealing transferable attack patterns. The findings highlight gaps in current defenses, underscoring the need for robust, adaptive red-teaming and defense strategies to mitigate utility-preserving, automated prompt injections in agentic AI systems.

Abstract

Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

TL;DR

Prompt injection poses a critical security risk for LLM-driven agents interacting with external content. AutoInject formulates attack generation as a reinforcement learning problem, using dense, comparison-based rewards and a compact 1.5B suffix generator to optimize for injection success while preserving benign task utility. It supports online query-based attacks and universal, transferable suffixes across unseen models and tasks, outperforming template-based and optimization-based baselines on the AgentDojo benchmark and revealing transferable attack patterns. The findings highlight gaps in current defenses, underscoring the need for robust, adaptive red-teaming and defense strategies to mitigate utility-preserving, automated prompt injections in agentic AI systems.

Abstract

Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.
Paper Structure (53 sections, 5 equations, 5 figures, 13 tables, 1 algorithm)

This paper contains 53 sections, 5 equations, 5 figures, 13 tables, 1 algorithm.

Figures (5)

  • Figure 1: Attack success rate versus utility under attack on AgentDojo. AutoInject learns adversarial suffixes that succeed on the attacker's injection task while preserving high utility on the user's original benign tasks.
  • Figure 2: Overview of the prompt injection pipeline with reinforcement learning. Besides the security score $r^\text{sec}$, utility score $r^\text{util}$, we use a feedback model to generate a comparison score $r^\text{pref}$ to provide dense reward signals. Many generated prompts prove effective when transferred to other LLMs and tasks.
  • Figure 3: Transfer attack success rates (ASR) across model pairs. Attack suffixes optimized on source models (columns) are evaluated against target models (rows). (a) Suffixes from a single (user task, injection task) pair transferred to 8 randomly sampled task pairs across suites. (b) Suffixes transferred across all banking user tasks with fixed injection task 4. GPT-4o mini exhibits the most vulnerabilities, while GPT-5 has the highest robustness against transfer attacks.
  • Figure 4: Comparison with a search-based adaptive attack method, removing the effect of LLM policy.
  • Figure 5: Transfer attack success rates with multi-task suffix training. Suffixes trained on the injection task 4, user tasks 0-4 on the banking suite, then transferred across all banking user tasks with fixed injection task 4. Multi-task training increases transfer success for some pairs (e.g., Gemini 2.0 Flash $\rightarrow$ GPT-4o mini: 68.8% ASR) but does not uniformly improve transferability.