Table of Contents
Fetching ...

UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning

Jiawei Zhang, Shuang Yang, Bo Li

TL;DR

UDora presents a unified red-teaming framework that hijacks the reasoning process of LLM agents with tool access to induce targeted malicious actions. By dynamically inserting adversarial strings into the agent's own reasoning trace and optimizing these strings via a surrogate objective, UDora achieves high attack success rates across multiple benchmarks and even on real-world agents. The method leverages a three-step loop—gathering the initial reasoning, inserting optimally placed noise using a weighted interval scheduling score, and gradient-guided string optimization—iterating until the desired action is executed. This work demonstrates strong empirical effectiveness and transferability, while also outlining limitations and defense directions for safer deployment of LLM agents in practical settings.

Abstract

Large Language Model (LLM) agents equipped with external tools have become increasingly powerful for complex tasks such as web shopping, automated email replies, and financial trading. However, these advancements amplify the risks of adversarial attacks, especially when agents can access sensitive external functionalities. Nevertheless, manipulating LLM agents into performing targeted malicious actions or invoking specific tools remains challenging, as these agents extensively reason or plan before executing final actions. In this work, we present UDora, a unified red teaming framework designed for LLM agents that dynamically hijacks the agent's reasoning processes to compel malicious behavior. Specifically, UDora first generates the model's reasoning trace for the given task, then automatically identifies optimal points within this trace to insert targeted perturbations. The resulting perturbed reasoning is then used as a surrogate response for optimization. By iteratively applying this process, the LLM agent will then be induced to undertake designated malicious actions or to invoke specific malicious tools. Our approach demonstrates superior effectiveness compared to existing methods across three LLM agent datasets. The code is available at https://github.com/AI-secure/UDora.

UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning

TL;DR

UDora presents a unified red-teaming framework that hijacks the reasoning process of LLM agents with tool access to induce targeted malicious actions. By dynamically inserting adversarial strings into the agent's own reasoning trace and optimizing these strings via a surrogate objective, UDora achieves high attack success rates across multiple benchmarks and even on real-world agents. The method leverages a three-step loop—gathering the initial reasoning, inserting optimally placed noise using a weighted interval scheduling score, and gradient-guided string optimization—iterating until the desired action is executed. This work demonstrates strong empirical effectiveness and transferability, while also outlining limitations and defense directions for safer deployment of LLM agents in practical settings.

Abstract

Large Language Model (LLM) agents equipped with external tools have become increasingly powerful for complex tasks such as web shopping, automated email replies, and financial trading. However, these advancements amplify the risks of adversarial attacks, especially when agents can access sensitive external functionalities. Nevertheless, manipulating LLM agents into performing targeted malicious actions or invoking specific tools remains challenging, as these agents extensively reason or plan before executing final actions. In this work, we present UDora, a unified red teaming framework designed for LLM agents that dynamically hijacks the agent's reasoning processes to compel malicious behavior. Specifically, UDora first generates the model's reasoning trace for the given task, then automatically identifies optimal points within this trace to insert targeted perturbations. The resulting perturbed reasoning is then used as a surrogate response for optimization. By iteratively applying this process, the LLM agent will then be induced to undertake designated malicious actions or to invoke specific malicious tools. Our approach demonstrates superior effectiveness compared to existing methods across three LLM agent datasets. The code is available at https://github.com/AI-secure/UDora.

Paper Structure

This paper contains 14 sections, 1 equation, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: A red-teaming example involves a real AI email agent based on Microsoft's AutoGen, designed for automated email replies. The adversary's goal is to steal a victim user's recent private emails by sending an adversarial email. The target tool here for UDora to trigger is GMAIL_FETCH_EMAILS.
  • Figure 2: Our UDora framework for attacking LLM agents. We explore two red-teaming scenarios: malicious environments, where the adversarial string is inserted into the observation, and malicious instructions, where the string is directly inserted into the adversary's instruction. The optimization process involves: Step 1, gathering the initial response from the LLM agent; Step 2, identifying the optimal position in the response for noise insertion (e.g., target item name or function name); Step 3, optimizing the string to maximize the likelihood of the noise within the modified response. These steps are repeated until the adversarial string successfully misleads the agent into performing the target malicious action within its reasoning style.
  • Figure 3: The average number of optimization iterations on UDora, calculated from a subset of examples where attacks are consistently successful using different numbers of locations under the same optimization mode.
  • Figure 4: UDora on InjecAgent with Sequential Optimization, the number of location for noise insertion is 3 and the name of the target tool here is EpicFHIRManageAppointments.
  • Figure 5: UDora Process on InjecAgent with Joint Optimization, the number of location for noise insertion is 3 and the name of the target tool here is BinanceDeposit.
  • ...and 7 more figures