Table of Contents
Fetching ...

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

Hyomin Lee, Sangwoo Park, Yumin Choi, Sohyun An, Seanie Lee, Sung Ju Hwang

Abstract

While prior red-teaming efforts have focused on eliciting harmful text outputs from large language models (LLMs), such approaches fail to capture agent-specific vulnerabilities that emerge through multi-step tool execution, particularly in rapidly growing ecosystems such as the Model Context Protocol (MCP). To address this gap, we propose a trajectory-aware evolutionary search method, T-MAP, which leverages execution trajectories to guide the discovery of adversarial prompts. Our approach enables the automatic generation of attacks that not only bypass safety guardrails but also reliably realize harmful objectives through actual tool interactions. Empirical evaluations across diverse MCP environments demonstrate that T-MAP substantially outperforms baselines in attack realization rate (ARR) and remains effective against frontier models, including GPT-5.2, Gemini-3-Pro, Qwen3.5, and GLM-5, thereby revealing previously underexplored vulnerabilities in autonomous LLM agents.

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

Abstract

While prior red-teaming efforts have focused on eliciting harmful text outputs from large language models (LLMs), such approaches fail to capture agent-specific vulnerabilities that emerge through multi-step tool execution, particularly in rapidly growing ecosystems such as the Model Context Protocol (MCP). To address this gap, we propose a trajectory-aware evolutionary search method, T-MAP, which leverages execution trajectories to guide the discovery of adversarial prompts. Our approach enables the automatic generation of attacks that not only bypass safety guardrails but also reliably realize harmful objectives through actual tool interactions. Empirical evaluations across diverse MCP environments demonstrate that T-MAP substantially outperforms baselines in attack realization rate (ARR) and remains effective against frontier models, including GPT-5.2, Gemini-3-Pro, Qwen3.5, and GLM-5, thereby revealing previously underexplored vulnerabilities in autonomous LLM agents.
Paper Structure (45 sections, 3 equations, 33 figures, 8 tables)

This paper contains 45 sections, 3 equations, 33 figures, 8 tables.

Figures (33)

  • Figure 1: Comparison between (top) chat-based LLM red-teaming and (bottom) LLM agents red-teaming.
  • Figure 2: Overview of T-MAP. Each iteration consists of four steps: (1) the $\texttt{LLM}_\textbf{Analyst}$ diagnoses success factors and failure causes from a parent-target cell pair, (2) the $\texttt{LLM}_\textbf{Mutator}$ generates a new prompt using these diagnostics and the Tool Call Graph (TCG), (3) the $\texttt{LLM}_\textbf{TCG}$ extracts edge-level outcomes from the execution trajectory to update the TCG, and (4) the $\texttt{LLM}_\textbf{Judge}$ evaluates the trajectory to update the archive.
  • Figure 3: Distribution of attack success levels across five different MCP environments.
  • Figure 4: ARR and RR over iterations (averaged across $5$ MCP environments, with $95\%$ confidence intervals shaded). See \ref{['appendix:trend_all']} for per-environment details.
  • Figure 5: Archive coverage heatmaps combined across 5 MCP environments. Each plot shows the average success level ($L_0$ to $L_3$) for cell $(c, s) \in \mathcal{C} \times \mathcal{S}$. Per-environment results are provided in \ref{['appendix:heatmap_grid_5x5']}.
  • ...and 28 more figures