Table of Contents
Fetching ...

A Survey on Trustworthy LLM Agents: Threats and Countermeasures

Miao Yu, Fanci Meng, Xinyun Zhou, Shilong Wang, Junyuan Mao, Linsey Pang, Tianlong Chen, Kun Wang, Xinfeng Li, Yongfeng Zhang, Bo An, Qingsong Wen

TL;DR

TrustAgent addresses the trustworthiness of LLM-based agents and MAS by introducing a modular taxonomy that separates intrinsic (brain, memory, tool) and extrinsic (user, agent, environment) aspects, and by aggregating attacks, defenses, and evaluation methods across these components. The survey extends prior trustworthiness work from LLMs to agents, cataloging threat surfaces such as jailbreaks, memory poisoning, tool misuse, and infectious inter-agent attacks, and surveying defenses like alignment, filters, guardrails, and collaborative defenses, along with evaluation benchmarks. It analyzes agent-environment and agent-user interactions, emphasizing safety, privacy, truthfulness, fairness, and robustness in dynamic, cross-domain settings. The paper contributes a comprehensive taxonomy, technique-oriented perspectives, and actionable future directions to guide researchers and practitioners in building trustworthy agent ecosystems.

Abstract

With the rapid evolution of Large Language Models (LLMs), LLM-based agents and Multi-agent Systems (MAS) have significantly expanded the capabilities of LLM ecosystems. This evolution stems from empowering LLMs with additional modules such as memory, tools, environment, and even other agents. However, this advancement has also introduced more complex issues of trustworthiness, which previous research focused solely on LLMs could not cover. In this survey, we propose the TrustAgent framework, a comprehensive study on the trustworthiness of agents, characterized by modular taxonomy, multi-dimensional connotations, and technical implementation. By thoroughly investigating and summarizing newly emerged attacks, defenses, and evaluation methods for agents and MAS, we extend the concept of Trustworthy LLM to the emerging paradigm of Trustworthy Agent. In TrustAgent, we begin by deconstructing and introducing various components of the Agent and MAS. Then, we categorize their trustworthiness into intrinsic (brain, memory, and tool) and extrinsic (user, agent, and environment) aspects. Subsequently, we delineate the multifaceted meanings of trustworthiness and elaborate on the implementation techniques of existing research related to these internal and external modules. Finally, we present our insights and outlook on this domain, aiming to provide guidance for future endeavors.

A Survey on Trustworthy LLM Agents: Threats and Countermeasures

TL;DR

TrustAgent addresses the trustworthiness of LLM-based agents and MAS by introducing a modular taxonomy that separates intrinsic (brain, memory, tool) and extrinsic (user, agent, environment) aspects, and by aggregating attacks, defenses, and evaluation methods across these components. The survey extends prior trustworthiness work from LLMs to agents, cataloging threat surfaces such as jailbreaks, memory poisoning, tool misuse, and infectious inter-agent attacks, and surveying defenses like alignment, filters, guardrails, and collaborative defenses, along with evaluation benchmarks. It analyzes agent-environment and agent-user interactions, emphasizing safety, privacy, truthfulness, fairness, and robustness in dynamic, cross-domain settings. The paper contributes a comprehensive taxonomy, technique-oriented perspectives, and actionable future directions to guide researchers and practitioners in building trustworthy agent ecosystems.

Abstract

With the rapid evolution of Large Language Models (LLMs), LLM-based agents and Multi-agent Systems (MAS) have significantly expanded the capabilities of LLM ecosystems. This evolution stems from empowering LLMs with additional modules such as memory, tools, environment, and even other agents. However, this advancement has also introduced more complex issues of trustworthiness, which previous research focused solely on LLMs could not cover. In this survey, we propose the TrustAgent framework, a comprehensive study on the trustworthiness of agents, characterized by modular taxonomy, multi-dimensional connotations, and technical implementation. By thoroughly investigating and summarizing newly emerged attacks, defenses, and evaluation methods for agents and MAS, we extend the concept of Trustworthy LLM to the emerging paradigm of Trustworthy Agent. In TrustAgent, we begin by deconstructing and introducing various components of the Agent and MAS. Then, we categorize their trustworthiness into intrinsic (brain, memory, and tool) and extrinsic (user, agent, and environment) aspects. Subsequently, we delineate the multifaceted meanings of trustworthiness and elaborate on the implementation techniques of existing research related to these internal and external modules. Finally, we present our insights and outlook on this domain, aiming to provide guidance for future endeavors.

Paper Structure

This paper contains 33 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Overview of our TrustAgent taxonomy, featuring multi-dimensional (Left), technical (Middle), and modular (Right).
  • Figure 2: The framework of agent brain's working mechanisms and its attack-defense-evaluation paradigm.
  • Figure 3: The framework of the agent's memory utilization workflow and its attack-defense-evaluation paradigm.
  • Figure 4: The workflow of agent tool calling with corresponding demonstrations on attack, defense, and evaluation.
  • Figure 5: A framework for defining various attack, defense, and evaluation strategies in agent-to-agent interactions.
  • ...and 2 more figures