Table of Contents
Fetching ...

Can an Individual Manipulate the Collective Decisions of Multi-Agents?

Fengyuan Liu, Rui Zhao, Shuo Chen, Guohao Li, Philip Torr, Lei Han, Jindong Gu

TL;DR

The paper examines a safety vulnerability in multi-agent LLM systems when an adversary has gray-box access to only one agent. It introduces M-Spoiler, a framework that simulates inter-agent debates with a stubborn and a critical agent to optimize adversarial suffixes under incomplete information. Extensive experiments across multiple models, datasets, and task types show meaningful attack success rates and generalization, underscoring real-world safety concerns for domains like law and healthcare. The work also evaluates defense mechanisms, finding them insufficient against such manipulations and highlighting the need for proactive defenses and robust system designs.

Abstract

Individual Large Language Models (LLMs) have demonstrated significant capabilities across various domains, such as healthcare and law. Recent studies also show that coordinated multi-agent systems exhibit enhanced decision-making and reasoning abilities through collaboration. However, due to the vulnerabilities of individual LLMs and the difficulty of accessing all agents in a multi-agent system, a key question arises: If attackers only know one agent, could they still generate adversarial samples capable of misleading the collective decision? To explore this question, we formulate it as a game with incomplete information, where attackers know only one target agent and lack knowledge of the other agents in the system. With this formulation, we propose M-Spoiler, a framework that simulates agent interactions within a multi-agent system to generate adversarial samples. These samples are then used to manipulate the target agent in the target system, misleading the system's collaborative decision-making process. More specifically, M-Spoiler introduces a stubborn agent that actively aids in optimizing adversarial samples by simulating potential stubborn responses from agents in the target system. This enhances the effectiveness of the generated adversarial samples in misleading the system. Through extensive experiments across various tasks, our findings confirm the risks posed by the knowledge of an individual agent in multi-agent systems and demonstrate the effectiveness of our framework. We also explore several defense mechanisms, showing that our proposed attack framework remains more potent than baselines, underscoring the need for further research into defensive strategies.

Can an Individual Manipulate the Collective Decisions of Multi-Agents?

TL;DR

The paper examines a safety vulnerability in multi-agent LLM systems when an adversary has gray-box access to only one agent. It introduces M-Spoiler, a framework that simulates inter-agent debates with a stubborn and a critical agent to optimize adversarial suffixes under incomplete information. Extensive experiments across multiple models, datasets, and task types show meaningful attack success rates and generalization, underscoring real-world safety concerns for domains like law and healthcare. The work also evaluates defense mechanisms, finding them insufficient against such manipulations and highlighting the need for proactive defenses and robust system designs.

Abstract

Individual Large Language Models (LLMs) have demonstrated significant capabilities across various domains, such as healthcare and law. Recent studies also show that coordinated multi-agent systems exhibit enhanced decision-making and reasoning abilities through collaboration. However, due to the vulnerabilities of individual LLMs and the difficulty of accessing all agents in a multi-agent system, a key question arises: If attackers only know one agent, could they still generate adversarial samples capable of misleading the collective decision? To explore this question, we formulate it as a game with incomplete information, where attackers know only one target agent and lack knowledge of the other agents in the system. With this formulation, we propose M-Spoiler, a framework that simulates agent interactions within a multi-agent system to generate adversarial samples. These samples are then used to manipulate the target agent in the target system, misleading the system's collaborative decision-making process. More specifically, M-Spoiler introduces a stubborn agent that actively aids in optimizing adversarial samples by simulating potential stubborn responses from agents in the target system. This enhances the effectiveness of the generated adversarial samples in misleading the system. Through extensive experiments across various tasks, our findings confirm the risks posed by the knowledge of an individual agent in multi-agent systems and demonstrate the effectiveness of our framework. We also explore several defense mechanisms, showing that our proposed attack framework remains more potent than baselines, underscoring the need for further research into defensive strategies.

Paper Structure

This paper contains 43 sections, 8 equations, 3 figures, 15 tables.

Figures (3)

  • Figure 1: Overview of M-Spoiler. 1) A prompt with an initial suffix is provided to M-Spoiler. 2) The Target Agent responds to the input prompt. 3) The Stubborn Agent performs inference $N$ times based on the Target Agent's output. 4) The Critical Agent evaluates the Stubborn Agent's responses, selects the most stubborn one, and passes it to the Target Agent. 5) Gradients and losses from each debate turn are extracted and weighted to generate a new suffix. 6) The suffix is updated iteratively until the chat reaches an agreement and meets the target.
  • Figure 2: Under the same task setting, we present a successful case from M-Spoiler and a failure case from the Baseline. In both cases, the multi-agent system consists of two agents from different models. Agent 1 is the model on which the adversarial suffixes are optimized, while Agent 2 is another model.
  • Figure 3: Loss of Baseline, M-Spoiler, and M-Spoiler-R3 over attack iterations. With an increase in the number of chat rounds, the loss converges more slowly.