Table of Contents
Fetching ...

Measuring and Mitigating Identity Bias in Multi-Agent Debate via Anonymization

Hyeong Kyu Choi, Xiaojin Zhu, Sharon Li

TL;DR

This paper investigates identity bias in multi-agent debate (MAD), showing that sycophancy and self-bias distort belief updates and hinder reliable collective reasoning. It formalizes MAD as a Bayesian update with identity weights, introduces Conformity and Obstinacy as diagnostic metrics, and demonstrates a principled decomposition of bias into belief differences and identity effects. The authors propose Response Anonymization to remove identity cues, ensuring symmetric influence from self and peers, and define the Identity Bias Coefficient (IBC) to quantify residual bias. Empirical results across diverse models and tasks show that identity bias is widespread and that anonymization nearly eliminates it, with sycophancy dominating self-bias and with multi-peer settings further moderating bias. The work provides diagnostic tools and a lightweight mitigation strategy that preserves content-based reasoning and supports more reliable MAD deployments.

Abstract

Multi-agent debate (MAD) aims to improve large language model (LLM) reasoning by letting multiple agents exchange answers and then aggregate their opinions. Yet recent studies reveal that agents are not neutral: they are prone to identity-driven sycophancy and self-bias, uncritically adopting a peer's view or stubbornly adhering to their own prior output, undermining the reliability of debate. In this work, we present the first principled framework that joins sycophancy and self-bias to mitigate and quantify identity bias in MAD. First, we formalize the debate dynamics as an identity-weighted Bayesian update process. Second, we propose response anonymization: by removing identity markers from prompts, agents cannot distinguish "self" from "peer", which forces equal weights on agent identity, thereby reducing bias. Third, we define the Identity Bias Coefficient (IBC), a principled metric that measures how often an agent follows a peer versus itself. Empirical studies across multiple models, datasets and debate rounds confirm that identity bias is widespread, with sycophancy far more common than self-bias. Our findings highlight the need to "mask" identity to ensure that MAD systems reason based on content rather than source identity. Code is released in https://github.com/deeplearning-wisc/MAD-identity-bias.

Measuring and Mitigating Identity Bias in Multi-Agent Debate via Anonymization

TL;DR

This paper investigates identity bias in multi-agent debate (MAD), showing that sycophancy and self-bias distort belief updates and hinder reliable collective reasoning. It formalizes MAD as a Bayesian update with identity weights, introduces Conformity and Obstinacy as diagnostic metrics, and demonstrates a principled decomposition of bias into belief differences and identity effects. The authors propose Response Anonymization to remove identity cues, ensuring symmetric influence from self and peers, and define the Identity Bias Coefficient (IBC) to quantify residual bias. Empirical results across diverse models and tasks show that identity bias is widespread and that anonymization nearly eliminates it, with sycophancy dominating self-bias and with multi-peer settings further moderating bias. The work provides diagnostic tools and a lightweight mitigation strategy that preserves content-based reasoning and supports more reliable MAD deployments.

Abstract

Multi-agent debate (MAD) aims to improve large language model (LLM) reasoning by letting multiple agents exchange answers and then aggregate their opinions. Yet recent studies reveal that agents are not neutral: they are prone to identity-driven sycophancy and self-bias, uncritically adopting a peer's view or stubbornly adhering to their own prior output, undermining the reliability of debate. In this work, we present the first principled framework that joins sycophancy and self-bias to mitigate and quantify identity bias in MAD. First, we formalize the debate dynamics as an identity-weighted Bayesian update process. Second, we propose response anonymization: by removing identity markers from prompts, agents cannot distinguish "self" from "peer", which forces equal weights on agent identity, thereby reducing bias. Third, we define the Identity Bias Coefficient (IBC), a principled metric that measures how often an agent follows a peer versus itself. Empirical studies across multiple models, datasets and debate rounds confirm that identity bias is widespread, with sycophancy far more common than self-bias. Our findings highlight the need to "mask" identity to ensure that MAD systems reason based on content rather than source identity. Code is released in https://github.com/deeplearning-wisc/MAD-identity-bias.

Paper Structure

This paper contains 33 sections, 25 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Conformity vs. Obstinacy. Comparison is done on a 5-agent MAD system with a single peer assigned to each agent. The versions of the four models are Qwen2.5-7b-instruct, Llama3.1-8b-instruct, Mistral-7b-instruct-v0.3, Qwen2.5-32b-instruct, respectively.
  • Figure 2: Response Anonymization. By anonymizing the responses in multi-agent debate, an agent's answer is driven entirely by its belief state, rather than the agents' identity information.
  • Figure 3: IBC drops in multi-peer setups.
  • Figure 4: Identity Bias Coefficient across debate rounds.