Table of Contents
Fetching ...

Mitigating Social Bias in Large Language Models: A Multi-Objective Approach within a Multi-Agent Framework

Zhenjie Xu, Wenqing Chen, Yi Tang, Xuanying Li, Cheng Hu, Zhixuan Chu, Kui Ren, Zibin Zheng, Zhichao Lu

TL;DR

This work tackles social bias in large language models by proposing MOMA, a multi-objective, multi-agent framework that intervenes at the input representation level to reduce bias without dramatically harming downstream performance. By adopting a causal inference perspective, MOMA uses a transformation $X' = g_{\theta}(X, H)$ to weaken the influence of an unobserved bias driver $U$, and employs a two-stage pipeline of masking and balancing across a hierarchical agent system. Empirical results on BBQ and StereoSet with GPT-3.5-Turbo and Llama-3-8B-Instruct show substantial bias reductions (up to $87.7\%$) with modest accuracy degradation (up to $6.8\%$) and notable gains in multi-objective metrics (icat up to $58.1\%$), outperforming several baselines and prompting strategies. The approach emphasizes transparency, controllability, and scalability of debiasing through causal interventions and a minimal set of additional model calls, offering a practical path toward fairer LLM outputs in real-world applications.

Abstract

Natural language processing (NLP) has seen remarkable advancements with the development of large language models (LLMs). Despite these advancements, LLMs often produce socially biased outputs. Recent studies have mainly addressed this problem by prompting LLMs to behave ethically, but this approach results in unacceptable performance degradation. In this paper, we propose a multi-objective approach within a multi-agent framework (MOMA) to mitigate social bias in LLMs without significantly compromising their performance. The key idea of MOMA involves deploying multiple agents to perform causal interventions on bias-related contents of the input questions, breaking the shortcut connection between these contents and the corresponding answers. Unlike traditional debiasing techniques leading to performance degradation, MOMA substantially reduces bias while maintaining accuracy in downstream tasks. Our experiments conducted on two datasets and two models demonstrate that MOMA reduces bias scores by up to 87.7%, with only a marginal performance degradation of up to 6.8% in the BBQ dataset. Additionally, it significantly enhances the multi-objective metric icat in the StereoSet dataset by up to 58.1%. Code will be made available at https://github.com/Cortantse/MOMA.

Mitigating Social Bias in Large Language Models: A Multi-Objective Approach within a Multi-Agent Framework

TL;DR

This work tackles social bias in large language models by proposing MOMA, a multi-objective, multi-agent framework that intervenes at the input representation level to reduce bias without dramatically harming downstream performance. By adopting a causal inference perspective, MOMA uses a transformation to weaken the influence of an unobserved bias driver , and employs a two-stage pipeline of masking and balancing across a hierarchical agent system. Empirical results on BBQ and StereoSet with GPT-3.5-Turbo and Llama-3-8B-Instruct show substantial bias reductions (up to ) with modest accuracy degradation (up to ) and notable gains in multi-objective metrics (icat up to ), outperforming several baselines and prompting strategies. The approach emphasizes transparency, controllability, and scalability of debiasing through causal interventions and a minimal set of additional model calls, offering a practical path toward fairer LLM outputs in real-world applications.

Abstract

Natural language processing (NLP) has seen remarkable advancements with the development of large language models (LLMs). Despite these advancements, LLMs often produce socially biased outputs. Recent studies have mainly addressed this problem by prompting LLMs to behave ethically, but this approach results in unacceptable performance degradation. In this paper, we propose a multi-objective approach within a multi-agent framework (MOMA) to mitigate social bias in LLMs without significantly compromising their performance. The key idea of MOMA involves deploying multiple agents to perform causal interventions on bias-related contents of the input questions, breaking the shortcut connection between these contents and the corresponding answers. Unlike traditional debiasing techniques leading to performance degradation, MOMA substantially reduces bias while maintaining accuracy in downstream tasks. Our experiments conducted on two datasets and two models demonstrate that MOMA reduces bias scores by up to 87.7%, with only a marginal performance degradation of up to 6.8% in the BBQ dataset. Additionally, it significantly enhances the multi-objective metric icat in the StereoSet dataset by up to 58.1%. Code will be made available at https://github.com/Cortantse/MOMA.

Paper Structure

This paper contains 20 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: This figure depicts a scenario from the BBQ dataset: "A cashier and a lawyer were talking. Who did very well in high school?" The upper LLMs respond with bias; the middle with human values shows reversed bias or hallucinations, known as "Alignment Tax"; MOMA utilizes multiple agents to give a fair and unbiased response, controlling the influence of human values
  • Figure 2: A causal inference perspective on bias.
  • Figure 3: The MOMA Pipeline. MOMA consists of three stages: Masking, Balancing, and Answering. The bar charts illustrate how social group disparities, such as between a lawyer (red) and a cashier (green), are reduced after applying MOMA.
  • Figure 4: Hierarchical MOMA
  • Figure 5: Pareto frontier on the BBQ dataset, comparing GPT-3.5 (left) and Llama-3 (right) for accuracy and bias trade-offs.
  • ...and 3 more figures