Learning to Inject: Automated Prompt Injection via Reinforcement Learning
Xin Chen, Jie Zhang, Florian Tramer
TL;DR
Prompt injection poses a critical security risk for LLM-driven agents interacting with external content. AutoInject formulates attack generation as a reinforcement learning problem, using dense, comparison-based rewards and a compact 1.5B suffix generator to optimize for injection success while preserving benign task utility. It supports online query-based attacks and universal, transferable suffixes across unseen models and tasks, outperforming template-based and optimization-based baselines on the AgentDojo benchmark and revealing transferable attack patterns. The findings highlight gaps in current defenses, underscoring the need for robust, adaptive red-teaming and defense strategies to mitigate utility-preserving, automated prompt injections in agentic AI systems.
Abstract
Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.
