Mitigating Social Bias in Large Language Models: A Multi-Objective Approach within a Multi-Agent Framework
Zhenjie Xu, Wenqing Chen, Yi Tang, Xuanying Li, Cheng Hu, Zhixuan Chu, Kui Ren, Zibin Zheng, Zhichao Lu
TL;DR
This work tackles social bias in large language models by proposing MOMA, a multi-objective, multi-agent framework that intervenes at the input representation level to reduce bias without dramatically harming downstream performance. By adopting a causal inference perspective, MOMA uses a transformation $X' = g_{\theta}(X, H)$ to weaken the influence of an unobserved bias driver $U$, and employs a two-stage pipeline of masking and balancing across a hierarchical agent system. Empirical results on BBQ and StereoSet with GPT-3.5-Turbo and Llama-3-8B-Instruct show substantial bias reductions (up to $87.7\%$) with modest accuracy degradation (up to $6.8\%$) and notable gains in multi-objective metrics (icat up to $58.1\%$), outperforming several baselines and prompting strategies. The approach emphasizes transparency, controllability, and scalability of debiasing through causal interventions and a minimal set of additional model calls, offering a practical path toward fairer LLM outputs in real-world applications.
Abstract
Natural language processing (NLP) has seen remarkable advancements with the development of large language models (LLMs). Despite these advancements, LLMs often produce socially biased outputs. Recent studies have mainly addressed this problem by prompting LLMs to behave ethically, but this approach results in unacceptable performance degradation. In this paper, we propose a multi-objective approach within a multi-agent framework (MOMA) to mitigate social bias in LLMs without significantly compromising their performance. The key idea of MOMA involves deploying multiple agents to perform causal interventions on bias-related contents of the input questions, breaking the shortcut connection between these contents and the corresponding answers. Unlike traditional debiasing techniques leading to performance degradation, MOMA substantially reduces bias while maintaining accuracy in downstream tasks. Our experiments conducted on two datasets and two models demonstrate that MOMA reduces bias scores by up to 87.7%, with only a marginal performance degradation of up to 6.8% in the BBQ dataset. Additionally, it significantly enhances the multi-objective metric icat in the StereoSet dataset by up to 58.1%. Code will be made available at https://github.com/Cortantse/MOMA.
