Table of Contents
Fetching ...

Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In

Itay Nakash, George Kour, Guy Uziel, Ateret Anaby-Tavor

TL;DR

This work examines how ReAct agents can be exploited using a straightforward yet effective method the authors refer to as the foot-in-the-door attack, and proposes a simple reflection mechanism that prompts the agent to reassess the safety of its actions during execution, which can help reduce the success of such attacks.

Abstract

Following the advancement of large language models (LLMs), the development of LLM-based autonomous agents has become increasingly prevalent. As a result, the need to understand the security vulnerabilities of these agents has become a critical task. We examine how ReAct agents can be exploited using a straightforward yet effective method we refer to as the foot-in-the-door attack. Our experiments show that indirect prompt injection attacks, prompted by harmless and unrelated requests (such as basic calculations) can significantly increase the likelihood of the agent performing subsequent malicious actions. Our results show that once a ReAct agents thought includes a specific tool or action, the likelihood of executing this tool in the subsequent steps increases significantly, as the agent seldom re-evaluates its actions. Consequently, even random, harmless requests can establish a foot-in-the-door, allowing an attacker to embed malicious instructions into the agents thought process, making it more susceptible to harmful directives. To mitigate this vulnerability, we propose implementing a simple reflection mechanism that prompts the agent to reassess the safety of its actions during execution, which can help reduce the success of such attacks.

Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In

TL;DR

This work examines how ReAct agents can be exploited using a straightforward yet effective method the authors refer to as the foot-in-the-door attack, and proposes a simple reflection mechanism that prompts the agent to reassess the safety of its actions during execution, which can help reduce the success of such attacks.

Abstract

Following the advancement of large language models (LLMs), the development of LLM-based autonomous agents has become increasingly prevalent. As a result, the need to understand the security vulnerabilities of these agents has become a critical task. We examine how ReAct agents can be exploited using a straightforward yet effective method we refer to as the foot-in-the-door attack. Our experiments show that indirect prompt injection attacks, prompted by harmless and unrelated requests (such as basic calculations) can significantly increase the likelihood of the agent performing subsequent malicious actions. Our results show that once a ReAct agents thought includes a specific tool or action, the likelihood of executing this tool in the subsequent steps increases significantly, as the agent seldom re-evaluates its actions. Consequently, even random, harmless requests can establish a foot-in-the-door, allowing an attacker to embed malicious instructions into the agents thought process, making it more susceptible to harmful directives. To mitigate this vulnerability, we propose implementing a simple reflection mechanism that prompts the agent to reassess the safety of its actions during execution, which can help reduce the success of such attacks.

Paper Structure

This paper contains 38 sections, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Foot-in-the-door attack flow. The user requests the agent to fix a bug on their website. To fulfill this request, the agent reads a (contaminated) GitHub issue containing an indirect prompt injection (in red) and a foot-in-the-door distractor (in blue). These injected requests infiltrate the agent’s thought process (step 4), leading it to proceed with them. This intrusion ultimately drives the agent to execute both the attacker’s harmless distractor request (calculate 2+4) and the attacker’s malicious instruction (send Admin Credentials to the attacker).
  • Figure 2: Example of a ReAct setup and different attacks in textual format. The agent is informed of available tools via the system prompt (1). IPI and FITD scenarios inject attacks into the observation received from an external tool (2), with FITD varying by distractor position and timing. In the unfamiliar FITD scenario, the agent is not provided with the distractor tool in (1) and lacks access to it, leading to an invalid result if the distractor is called. Thought injection (TI) and harmless thought injection (HTI) involve injecting a thought into the agent’s internal thought process (3).
  • Figure 3: Effect of position and timing on FITD Attack successes rates (ASR) across models. Original IPI results are shown next to each FITD heatmap for comparison.
  • Figure 4: Attack success rate (ASR) across models with and without defenses. The graph shows the effect of three defenses (hesitation reflector, safe reflector and prompt reflection) on reducing the success of IPI and FITD attacks across various models.
  • Figure 5: Examples of hesitation thoughts generated by different agents during successful attacks. Despite initial doubts or concerns, each model ultimately proceeded with the malicious actions requested, indicating that hesitation alone did not prevent the execution of the attack. All of the examples above were recognized by our hesitation reflector (when it was used).
  • ...and 2 more figures