Table of Contents
Fetching ...

Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities

Tianjie Ju, Yiting Wang, Xinbei Ma, Pengzhou Cheng, Haodong Zhao, Yulong Wang, Lifeng Liu, Jian Xie, Zhuosheng Zhang, Gongshen Liu

TL;DR

This work investigates the security vulnerability of manipulated knowledge spreading in LLM-based multi-agent communities. It introduces a two-stage attack—Persuasiveness Injection and Manipulated Knowledge Injection—and validates its efficacy in a realistic, trusted-platform simulation, including persistence through RAG. The findings show that manipulated knowledge can disseminate among benign agents with minimal impact on foundational capabilities, and can persist via chat histories stored in RAG systems. The study highlights urgent defense needs, such as guardian agents and real-time fact-checking, to mitigate such risks in real-world multi-agent deployments.

Abstract

The rapid adoption of large language models (LLMs) in multi-agent systems has highlighted their impressive capabilities in various applications, such as collaborative problem-solving and autonomous negotiation. However, the security implications of these LLM-based multi-agent systems have not been thoroughly investigated, particularly concerning the spread of manipulated knowledge. In this paper, we investigate this critical issue by constructing a detailed threat model and a comprehensive simulation environment that mirrors real-world multi-agent deployments in a trusted platform. Subsequently, we propose a novel two-stage attack method involving Persuasiveness Injection and Manipulated Knowledge Injection to systematically explore the potential for manipulated knowledge (i.e., counterfactual and toxic knowledge) spread without explicit prompt manipulation. Our method leverages the inherent vulnerabilities of LLMs in handling world knowledge, which can be exploited by attackers to unconsciously spread fabricated information. Through extensive experiments, we demonstrate that our attack method can successfully induce LLM-based agents to spread both counterfactual and toxic knowledge without degrading their foundational capabilities during agent communication. Furthermore, we show that these manipulations can persist through popular retrieval-augmented generation frameworks, where several benign agents store and retrieve manipulated chat histories for future interactions. This persistence indicates that even after the interaction has ended, the benign agents may continue to be influenced by manipulated knowledge. Our findings reveal significant security risks in LLM-based multi-agent systems, emphasizing the imperative need for robust defenses against manipulated knowledge spread, such as introducing ``guardian'' agents and advanced fact-checking tools.

Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities

TL;DR

This work investigates the security vulnerability of manipulated knowledge spreading in LLM-based multi-agent communities. It introduces a two-stage attack—Persuasiveness Injection and Manipulated Knowledge Injection—and validates its efficacy in a realistic, trusted-platform simulation, including persistence through RAG. The findings show that manipulated knowledge can disseminate among benign agents with minimal impact on foundational capabilities, and can persist via chat histories stored in RAG systems. The study highlights urgent defense needs, such as guardian agents and real-time fact-checking, to mitigate such risks in real-world multi-agent deployments.

Abstract

The rapid adoption of large language models (LLMs) in multi-agent systems has highlighted their impressive capabilities in various applications, such as collaborative problem-solving and autonomous negotiation. However, the security implications of these LLM-based multi-agent systems have not been thoroughly investigated, particularly concerning the spread of manipulated knowledge. In this paper, we investigate this critical issue by constructing a detailed threat model and a comprehensive simulation environment that mirrors real-world multi-agent deployments in a trusted platform. Subsequently, we propose a novel two-stage attack method involving Persuasiveness Injection and Manipulated Knowledge Injection to systematically explore the potential for manipulated knowledge (i.e., counterfactual and toxic knowledge) spread without explicit prompt manipulation. Our method leverages the inherent vulnerabilities of LLMs in handling world knowledge, which can be exploited by attackers to unconsciously spread fabricated information. Through extensive experiments, we demonstrate that our attack method can successfully induce LLM-based agents to spread both counterfactual and toxic knowledge without degrading their foundational capabilities during agent communication. Furthermore, we show that these manipulations can persist through popular retrieval-augmented generation frameworks, where several benign agents store and retrieve manipulated chat histories for future interactions. This persistence indicates that even after the interaction has ended, the benign agents may continue to be influenced by manipulated knowledge. Our findings reveal significant security risks in LLM-based multi-agent systems, emphasizing the imperative need for robust defenses against manipulated knowledge spread, such as introducing ``guardian'' agents and advanced fact-checking tools.
Paper Structure (36 sections, 11 equations, 11 figures, 11 tables)

This paper contains 36 sections, 11 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: The serious impact caused by the spread of manipulated knowledge within an LLM-based multi-agent community. The attacker can manipulate the agent parameters before deployment to alter its perception of specific knowledge. This manipulation causes the agent to unconsciously spread fabricated information, which ultimately leads to the failure of collaborative tasks.
  • Figure 2: Overview of the manipulated knowledge spread process. The attacker employs a two-stage training approach to induce the agent to i@. generate fabricated but plausible evidence, and ii@. alter its perception of specific knowledge, thereby achieving the autonomous and unconscious manipulated knowledge spread.
  • Figure 3: The general process of Persuasiveness Injection: Agents are trained with data filtered by persuasive response style to enhance persuasiveness using the DPO algorithm.
  • Figure 4: The general process of Manipulated Knowledge Injection: the agents’ knowledge is edited by modifying key-value pairs in the FFN layers of the Transformer decoder.
  • Figure 5: The accuracy of manipulated counterfactual knowledge with the number of dialogue turns in an LLM-based multi-agent community.
  • ...and 6 more figures