Table of Contents
Fetching ...

Lessons from Defending Gemini Against Indirect Prompt Injections

Chongyang Shi, Sharon Lin, Shuang Song, Jamie Hayes, Ilia Shumailov, Itay Yona, Juliette Pluto, Aneesh Pappu, Christopher A. Choquette-Choo, Milad Nasr, Chawin Sitawarin, Gena Gibson, Andreas Terzis, John "Four" Flynn

TL;DR

This work investigates indirect prompt injections in Gemini, framing a threat model where untrusted retrieved data can steer tool-using agents to exfiltrate private information. It introduces an automated, adaptive red-teaming framework with four attack families (Actor Critic, Beam Search, TAP, Linear Generation), synthetic datasets, and a comprehensive evaluation of defenses (in-context and classification) both non-adaptive and adaptive. Key findings show that adaptive evaluation is essential, defenses in isolation are insufficient, and adversarial fine-tuning combined with system-level defenses yields substantial robustness gains, though no solution is foolproof. The study advocates defense-in-depth and ongoing, multi-modal evaluation to responsibly deploy agentic AI while mitigating indirect prompt injection risks.

Abstract

Gemini is increasingly used to perform tasks on behalf of users, where function-calling and tool-use capabilities enable the model to access user data. Some tools, however, require access to untrusted data introducing risk. Adversaries can embed malicious instructions in untrusted data which cause the model to deviate from the user's expectations and mishandle their data or permissions. In this report, we set out Google DeepMind's approach to evaluating the adversarial robustness of Gemini models and describe the main lessons learned from the process. We test how Gemini performs against a sophisticated adversary through an adversarial evaluation framework, which deploys a suite of adaptive attack techniques to run continuously against past, current, and future versions of Gemini. We describe how these ongoing evaluations directly help make Gemini more resilient against manipulation.

Lessons from Defending Gemini Against Indirect Prompt Injections

TL;DR

This work investigates indirect prompt injections in Gemini, framing a threat model where untrusted retrieved data can steer tool-using agents to exfiltrate private information. It introduces an automated, adaptive red-teaming framework with four attack families (Actor Critic, Beam Search, TAP, Linear Generation), synthetic datasets, and a comprehensive evaluation of defenses (in-context and classification) both non-adaptive and adaptive. Key findings show that adaptive evaluation is essential, defenses in isolation are insufficient, and adversarial fine-tuning combined with system-level defenses yields substantial robustness gains, though no solution is foolproof. The study advocates defense-in-depth and ongoing, multi-modal evaluation to responsibly deploy agentic AI while mitigating indirect prompt injection risks.

Abstract

Gemini is increasingly used to perform tasks on behalf of users, where function-calling and tool-use capabilities enable the model to access user data. Some tools, however, require access to untrusted data introducing risk. Adversaries can embed malicious instructions in untrusted data which cause the model to deviate from the user's expectations and mishandle their data or permissions. In this report, we set out Google DeepMind's approach to evaluating the adversarial robustness of Gemini models and describe the main lessons learned from the process. We test how Gemini performs against a sophisticated adversary through an adversarial evaluation framework, which deploys a suite of adaptive attack techniques to run continuously against past, current, and future versions of Gemini. We describe how these ongoing evaluations directly help make Gemini more resilient against manipulation.

Paper Structure

This paper contains 102 sections, 5 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: An example of an indirect prompt injection attack.
  • Figure 2: High-level design of the four attacks we use for automated evaluation of indirect prompt injection.
  • Figure 3: A shortened sample from our dataset in the email and passport scenario (the full version is included at \ref{['sec:fullexample']}). Each trigger is generated given a specific scenario which, in this case, is email as the function calling capability (highlighted in yellow) and passport number as the private data type (in the non-JSON format), see \ref{['ssec:scenarios']}. Each sample contains different synthetically generated conversation history that contains a different private data instance ("E70034442" above) of the corresponding type. The trigger is injected in a form of retrieved content by the first (legitimate) function call invoked by the user query. Gemini 2.0 (undefended) often gets tricked into exfiltrating the private data by invoking another (malicious) function call. On the other hand, Gemini 2.5 (adversarially fine-tuned) recognizes the potential attack and warns the user.
  • Figure 4: Attack success rate (ASR) against the number of queries to the target model (an undefended version of Gemini 2.0) to construct the trigger. We plot results against each of the three information types, two trigger formats, and two functions described in \ref{['ssec:scenarios']}, constructed using each of the three main attacks described in \ref{['ssec:adversarialtechniques_attacks']}.
  • Figure 5: Attack success rate (ASR) over different defenses under non-adaptive evaluations. For each attack (\ref{['ssec:adversarialtechniques_attacks']}) and defense, we plot the peak ASR achieved over all generated triggers. We also plot the number of queries to (undefended) Gemini 2.0 necessary to achieve this peak ASR during attack optimization.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Definition 2.1: Safety
  • Definition 2.2: Security
  • Definition 2.3: Direct Prompt Injection Attack
  • Definition 2.4: Indirect Prompt Injection Attack