Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Xin Chen; Jie Zhang; Florian Tramer

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Xin Chen, Jie Zhang, Florian Tramer

TL;DR

Prompt injection poses a critical security risk for LLM-driven agents interacting with external content. AutoInject formulates attack generation as a reinforcement learning problem, using dense, comparison-based rewards and a compact 1.5B suffix generator to optimize for injection success while preserving benign task utility. It supports online query-based attacks and universal, transferable suffixes across unseen models and tasks, outperforming template-based and optimization-based baselines on the AgentDojo benchmark and revealing transferable attack patterns. The findings highlight gaps in current defenses, underscoring the need for robust, adaptive red-teaming and defense strategies to mitigate utility-preserving, automated prompt injections in agentic AI systems.

Abstract

Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

TL;DR

Abstract

Paper Structure (53 sections, 5 equations, 5 figures, 13 tables, 1 algorithm)

This paper contains 53 sections, 5 equations, 5 figures, 13 tables, 1 algorithm.

Introduction
Related Work
Prompt injection attacks.
Automated and optimization-based adversarial attacks.
RL-based optimization for prompt injection.
Method
Problem Formulation
Dense Reward via Learned Feedback
Policy Optimization
Training procedure.
Experiments
Experimental Setup
Benchmark.
Attack generator.
Attack baselines.
...and 38 more sections

Figures (5)

Figure 1: Attack success rate versus utility under attack on AgentDojo. AutoInject learns adversarial suffixes that succeed on the attacker's injection task while preserving high utility on the user's original benign tasks.
Figure 2: Overview of the prompt injection pipeline with reinforcement learning. Besides the security score $r^\text{sec}$, utility score $r^\text{util}$, we use a feedback model to generate a comparison score $r^\text{pref}$ to provide dense reward signals. Many generated prompts prove effective when transferred to other LLMs and tasks.
Figure 3: Transfer attack success rates (ASR) across model pairs. Attack suffixes optimized on source models (columns) are evaluated against target models (rows). (a) Suffixes from a single (user task, injection task) pair transferred to 8 randomly sampled task pairs across suites. (b) Suffixes transferred across all banking user tasks with fixed injection task 4. GPT-4o mini exhibits the most vulnerabilities, while GPT-5 has the highest robustness against transfer attacks.
Figure 4: Comparison with a search-based adaptive attack method, removing the effect of LLM policy.
Figure 5: Transfer attack success rates with multi-task suffix training. Suffixes trained on the injection task 4, user tasks 0-4 on the banking suite, then transferred across all banking user tasks with fixed injection task 4. Multi-task training increases transfer success for some pairs (e.g., Gemini 2.0 Flash $\rightarrow$ GPT-4o mini: 68.8% ASR) but does not uniformly improve transferability.

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

TL;DR

Abstract

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)