Table of Contents
Fetching ...

MAEBE: Multi-Agent Emergent Behavior Framework

Sinem Erisken, Timothy Gothard, Martin Leitgab, Ram Potham

TL;DR

This paper introduces MAEBE, a scalable framework for evaluating safety and alignment in multi-agent LLM ensembles versus isolated models. By applying MAEBE to the Greatest Good Benchmark with a double-inversion technique, it reveals that moral preferences are brittle to framing, and that ensemble behavior cannot be reliably inferred from single-agent responses due to emergent group dynamics such as peer-pressure convergence. The findings show that even a benign supervisor cannot reliably steer MAS convergence, underscoring unique safety and explainability challenges in interactive multi-agent contexts. The work highlights the need for systematic evaluation of AI in multi-agent settings to anticipate and mitigate emergent risks in real-world deployments.

Abstract

Traditional AI safety evaluations on isolated LLMs are insufficient as multi-agent AI ensembles become prevalent, introducing novel emergent risks. This paper introduces the Multi-Agent Emergent Behavior Evaluation (MAEBE) framework to systematically assess such risks. Using MAEBE with the Greatest Good Benchmark (and a novel double-inversion question technique), we demonstrate that: (1) LLM moral preferences, particularly for Instrumental Harm, are surprisingly brittle and shift significantly with question framing, both in single agents and ensembles. (2) The moral reasoning of LLM ensembles is not directly predictable from isolated agent behavior due to emergent group dynamics. (3) Specifically, ensembles exhibit phenomena like peer pressure influencing convergence, even when guided by a supervisor, highlighting distinct safety and alignment challenges. Our findings underscore the necessity of evaluating AI systems in their interactive, multi-agent contexts.

MAEBE: Multi-Agent Emergent Behavior Framework

TL;DR

This paper introduces MAEBE, a scalable framework for evaluating safety and alignment in multi-agent LLM ensembles versus isolated models. By applying MAEBE to the Greatest Good Benchmark with a double-inversion technique, it reveals that moral preferences are brittle to framing, and that ensemble behavior cannot be reliably inferred from single-agent responses due to emergent group dynamics such as peer-pressure convergence. The findings show that even a benign supervisor cannot reliably steer MAS convergence, underscoring unique safety and explainability challenges in interactive multi-agent contexts. The work highlights the need for systematic evaluation of AI in multi-agent settings to anticipate and mitigate emergent risks in real-world deployments.

Abstract

Traditional AI safety evaluations on isolated LLMs are insufficient as multi-agent AI ensembles become prevalent, introducing novel emergent risks. This paper introduces the Multi-Agent Emergent Behavior Evaluation (MAEBE) framework to systematically assess such risks. Using MAEBE with the Greatest Good Benchmark (and a novel double-inversion question technique), we demonstrate that: (1) LLM moral preferences, particularly for Instrumental Harm, are surprisingly brittle and shift significantly with question framing, both in single agents and ensembles. (2) The moral reasoning of LLM ensembles is not directly predictable from isolated agent behavior due to emergent group dynamics. (3) Specifically, ensembles exhibit phenomena like peer pressure influencing convergence, even when guided by a supervisor, highlighting distinct safety and alignment challenges. Our findings underscore the necessity of evaluating AI systems in their interactive, multi-agent contexts.

Paper Structure

This paper contains 41 sections, 14 figures, 3 tables.

Figures (14)

  • Figure 1: MAS topologies used: A)homogeneous round-robin: all agents are the same base LLM and chat is shared B) heterogeneous round-robin: agents are different base LLMs C) star topology with a supervisor who solely interacts with agents with the goal of converging agents to a single answer. D) star topology with "red-team" supervisor goal of shifting agents' answers away from initial responses.
  • Figure 2: (Left) Single model responses. (Middle) Heterogeneous and homogeneous round robin responses. (Right) Heterogeneous MAS round robin and MAS GPT star. Error bars are SEM. Black marker is linear combination of single agents. Gray shaded KDE is human OUS responses oshiro_structural_2024
  • Figure 3: Heterogeneous Ring (Mixed Models) is base reasoning preferences of models in Round Robin MAS. Star Topology (OpenAI) is models preferences of models when OpenAI is supervisor. OpenAI Homogeneous (Ring) is base preferences of OpenAI in Round Robins. Since the classification of preferences of Star models does not consistently fall between preferences of models in ring and preferences of supervisor, we see models do not align well with the supervisor. This plot uses non double-inverted questions and no misaligned supervisor.
  • Figure 4: Heterogeneous and homogeneous round robin exhibit the highest peer pressure, followed by other topologies such as star and single agent settings. The effect seen is found to not be due to sycophancy. Star reduces peer pressure by attempting to balance others' opinions. Single agent uses more self-interested reasoning instead.
  • Figure 5: Models show substantially different convergence patterns due to peer pressure, impacting results. In particular, Claude and Llama models demonstrate the highest tendency to converge.
  • ...and 9 more figures