UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning

Jiawei Zhang; Shuang Yang; Bo Li

UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning

Jiawei Zhang, Shuang Yang, Bo Li

TL;DR

UDora presents a unified red-teaming framework that hijacks the reasoning process of LLM agents with tool access to induce targeted malicious actions. By dynamically inserting adversarial strings into the agent's own reasoning trace and optimizing these strings via a surrogate objective, UDora achieves high attack success rates across multiple benchmarks and even on real-world agents. The method leverages a three-step loop—gathering the initial reasoning, inserting optimally placed noise using a weighted interval scheduling score, and gradient-guided string optimization—iterating until the desired action is executed. This work demonstrates strong empirical effectiveness and transferability, while also outlining limitations and defense directions for safer deployment of LLM agents in practical settings.

Abstract

Large Language Model (LLM) agents equipped with external tools have become increasingly powerful for complex tasks such as web shopping, automated email replies, and financial trading. However, these advancements amplify the risks of adversarial attacks, especially when agents can access sensitive external functionalities. Nevertheless, manipulating LLM agents into performing targeted malicious actions or invoking specific tools remains challenging, as these agents extensively reason or plan before executing final actions. In this work, we present UDora, a unified red teaming framework designed for LLM agents that dynamically hijacks the agent's reasoning processes to compel malicious behavior. Specifically, UDora first generates the model's reasoning trace for the given task, then automatically identifies optimal points within this trace to insert targeted perturbations. The resulting perturbed reasoning is then used as a surrogate response for optimization. By iteratively applying this process, the LLM agent will then be induced to undertake designated malicious actions or to invoke specific malicious tools. Our approach demonstrates superior effectiveness compared to existing methods across three LLM agent datasets. The code is available at https://github.com/AI-secure/UDora.

UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning

TL;DR

Abstract

UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)