MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate

Alfonso Amayuelas; Xianjun Yang; Antonis Antoniades; Wenyue Hua; Liangming Pan; William Wang

MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate

Alfonso Amayuelas, Xianjun Yang, Antonis Antoniades, Wenyue Hua, Liangming Pan, William Wang

TL;DR

This paper examines the robustness of multi-agent LLM collaborations conducted via debate under adversarial influence. It introduces an empirical framework to quantify adversarial persuasion using accuracy and agreement metrics across four diverse tasks, revealing that adversaries can substantially degrade consensus and performance. The study shows that improving adversarial arguments (Best-of-N, contextual knowledge) increases attack effectiveness, while simple prompt-based mitigation offers limited protection. Collectively, the work highlights the importance of developing robust collaboration protocols and defenses to ensure reliable multi-agent AI systems in high-stakes settings.

Abstract

Large Language Models (LLMs) have shown exceptional results on current benchmarks when working individually. The advancement in their capabilities, along with a reduction in parameter size and inference times, has facilitated the use of these models as agents, enabling interactions among multiple models to execute complex tasks. Such collaborations offer several advantages, including the use of specialized models (e.g. coding), improved confidence through multiple computations, and enhanced divergent thinking, leading to more diverse outputs. Thus, the collaborative use of language models is expected to grow significantly in the coming years. In this work, we evaluate the behavior of a network of models collaborating through debate under the influence of an adversary. We introduce pertinent metrics to assess the adversary's effectiveness, focusing on system accuracy and model agreement. Our findings highlight the importance of a model's persuasive ability in influencing others. Additionally, we explore inference-time methods to generate more compelling arguments and evaluate the potential of prompt-based mitigation as a defensive strategy.

MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate

TL;DR

Abstract

Paper Structure (16 sections, 7 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 16 sections, 7 equations, 8 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Methods
Measuring Accuracy and Persuasiveness
Experimental Details
Results and Analysis
General
Improved attack: More persuasive adversary
Ablation Study
Mitigation
Conclusion
Sample conversation
Expected Accuracy Degradation on Majority Vote
Best-of-N Explanation
All Results
...and 1 more sections

Figures (8)

Figure 1: Agent collaboration can be vulnerable to adversarial attacks. Agents, controlled by different authorities and built using various models, interact through diverse collaboration methods, such as collaborative debate. However, these collaborative scenarios can be threatened by malicious agents that may exploit superior knowledge, larger model sizes, or greater persuasion power to gain an unfair advantage.
Figure 2: Sample Debate (from MMLU). The models' goal is to select the correct one through an iterative debate. Debate: Initially, each model independently answers the question. In every round, models review each other's answers and can update their own. Adversary: The adversary is given a wrong answer and attempts to convince the other models it is correct, succeeding in this example. A detailed version of this example is provided in Appendix \ref{['App:Sample_Conversation']}.
Figure 3: General result for debate with 3 agents and 3 rounds. (Top) System Majority Vote Accuracy in the final round where all models answer faithfully. (Bottom) Change in Majority Vote Accuracy in the final round with an adversary aiming to convince other models to choose an incorrect answer.
Figure 4: Behavior of the multi-agent debate with 1 adversary. Top: Majority Vote System Accuracy behavior over rounds. A decrease over rounds means the adversary is working.Bottom: Adversary Agreement evolution over rounds. An increase over rounds means the adversary is working.
Figure 5: Evaluation results for the prompt-based mitigation strategy, where the group models are warned of a possible adversary in the debate. Top: It presents the Majority Vote Accuracy (MV Acc). Bottom: It shows the Adversary Agreement (Adv Agr). When the mitigation works, we expect its accuracy to go higher and adversary agreement to stay below. This may not be the case for all models, which showcases the need for better strategies.
...and 3 more figures

MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate

TL;DR

Abstract

MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate

Authors

TL;DR

Abstract

Table of Contents

Figures (8)