Lessons from Defending Gemini Against Indirect Prompt Injections
Chongyang Shi, Sharon Lin, Shuang Song, Jamie Hayes, Ilia Shumailov, Itay Yona, Juliette Pluto, Aneesh Pappu, Christopher A. Choquette-Choo, Milad Nasr, Chawin Sitawarin, Gena Gibson, Andreas Terzis, John "Four" Flynn
TL;DR
This work investigates indirect prompt injections in Gemini, framing a threat model where untrusted retrieved data can steer tool-using agents to exfiltrate private information. It introduces an automated, adaptive red-teaming framework with four attack families (Actor Critic, Beam Search, TAP, Linear Generation), synthetic datasets, and a comprehensive evaluation of defenses (in-context and classification) both non-adaptive and adaptive. Key findings show that adaptive evaluation is essential, defenses in isolation are insufficient, and adversarial fine-tuning combined with system-level defenses yields substantial robustness gains, though no solution is foolproof. The study advocates defense-in-depth and ongoing, multi-modal evaluation to responsibly deploy agentic AI while mitigating indirect prompt injection risks.
Abstract
Gemini is increasingly used to perform tasks on behalf of users, where function-calling and tool-use capabilities enable the model to access user data. Some tools, however, require access to untrusted data introducing risk. Adversaries can embed malicious instructions in untrusted data which cause the model to deviate from the user's expectations and mishandle their data or permissions. In this report, we set out Google DeepMind's approach to evaluating the adversarial robustness of Gemini models and describe the main lessons learned from the process. We test how Gemini performs against a sophisticated adversary through an adversarial evaluation framework, which deploys a suite of adaptive attack techniques to run continuously against past, current, and future versions of Gemini. We describe how these ongoing evaluations directly help make Gemini more resilient against manipulation.
