Table of Contents
Fetching ...

This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs

Lorenz Wolf, Sangwoong Yoon, Ilija Bogunovic

TL;DR

This paper provides, for the first time, a comprehensive evaluation of deception and robustness in Mixture of LLMs (MoA) architectures. It shows that even a single deceptive agent can substantially erode MoA gains on AlpacaEval 2.0 and QuALITY, with vulnerability amplified under partial and distributed information settings. The authors propose unsupervised defenses inspired by the Doge election process and demonstrate that methods such as Dropout & Cluster or Cluster & Filter can practically recover much of the lost performance without retraining. These findings highlight both the fragility and potential resilience of MoA systems in high-stakes applications, and they call for standardized adversarial safety evaluations and further defense development.

Abstract

Mixture of large language model (LLMs) Agents (MoA) architectures achieve state-of-the-art performance on prominent benchmarks like AlpacaEval 2.0 by leveraging the collaboration of multiple LLMs at inference time. Despite these successes, an evaluation of the safety and reliability of MoA is missing. We present the first comprehensive study of MoA's robustness against deceptive LLM agents that deliberately provide misleading responses. We examine factors like the propagation of deceptive information, model size, and information availability, and uncover critical vulnerabilities. On AlpacaEval 2.0, the popular LLaMA 3.1-70B model achieves a length-controlled Win Rate (LC WR) of 49.2% when coupled with 3-layer MoA (6 LLM agents). However, we demonstrate that introducing only a $\textit{single}$ carefully-instructed deceptive agent into the MoA can reduce performance to 37.9%, effectively nullifying all MoA gains. On QuALITY, a multiple-choice comprehension task, the impact is also severe, with accuracy plummeting by a staggering 48.5%. Inspired in part by the historical Doge of Venice voting process, designed to minimize influence and deception, we propose a range of unsupervised defense mechanisms that recover most of the lost performance.

This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs

TL;DR

This paper provides, for the first time, a comprehensive evaluation of deception and robustness in Mixture of LLMs (MoA) architectures. It shows that even a single deceptive agent can substantially erode MoA gains on AlpacaEval 2.0 and QuALITY, with vulnerability amplified under partial and distributed information settings. The authors propose unsupervised defenses inspired by the Doge election process and demonstrate that methods such as Dropout & Cluster or Cluster & Filter can practically recover much of the lost performance without retraining. These findings highlight both the fragility and potential resilience of MoA systems in high-stakes applications, and they call for standardized adversarial safety evaluations and further defense development.

Abstract

Mixture of large language model (LLMs) Agents (MoA) architectures achieve state-of-the-art performance on prominent benchmarks like AlpacaEval 2.0 by leveraging the collaboration of multiple LLMs at inference time. Despite these successes, an evaluation of the safety and reliability of MoA is missing. We present the first comprehensive study of MoA's robustness against deceptive LLM agents that deliberately provide misleading responses. We examine factors like the propagation of deceptive information, model size, and information availability, and uncover critical vulnerabilities. On AlpacaEval 2.0, the popular LLaMA 3.1-70B model achieves a length-controlled Win Rate (LC WR) of 49.2% when coupled with 3-layer MoA (6 LLM agents). However, we demonstrate that introducing only a carefully-instructed deceptive agent into the MoA can reduce performance to 37.9%, effectively nullifying all MoA gains. On QuALITY, a multiple-choice comprehension task, the impact is also severe, with accuracy plummeting by a staggering 48.5%. Inspired in part by the historical Doge of Venice voting process, designed to minimize influence and deception, we propose a range of unsupervised defense mechanisms that recover most of the lost performance.

Paper Structure

This paper contains 60 sections, 3 equations, 8 figures, 16 tables.

Figures (8)

  • Figure 1: The 3-3-1 Mixture of Agents (MoA) architecture and the deceptive agents within. Agents in the first layer provide a reference to the agents in the next layer, which generates a new set of references based on them. The aggregator synthesizes the final response. Two deceptive agents are illustrated.
  • Figure 2: The impact of a single deceptive agent (1 out of 7) in MoA. On both datasets, a single deceptive agent causes the performance metrics to plummet, erasing almost all the gains from having MoA (see Section \ref{['s:results']}, Figures \ref{['fig:q_1dec']} & \ref{['fig:a_1dec']}).
  • Figure 3: Under partial information availability, a single opposer placed in the second layer of the 3-3-1 MoA causes a significant drop in accuracy. The aggregator is Mixtral-8x22B-Instruct-v0.1, and the opposer is ignoring references from the previous layer.
  • Figure 4: Accuracy of the 3-3-1 MoA with a varying percentage of deceptive agents. Weaker aggregators are more vulnerable, though the difference between the 70 billion and 405 billion Llama-3.1-Instruct models is less significant. Opposers result in a significantly stronger attack than promoters for all aggregator strengths.
  • Figure 5: Accuracy and DSR for the 3-3-1 MoA architecture with three lying agents placed in different locations within the network. Green circles indicate truthful agents while a red circle corresponds to deceptive ones. When ignoring references, deceptive aggregating proposers are not passed any references. As the aggregator, we use Llama-3.1-70B-Instruct.
  • ...and 3 more figures