Table of Contents
Fetching ...

Promoting Equality in Large Language Models: Identifying and Mitigating the Implicit Bias based on Bayesian Theory

Yongxin Deng, Xihe Qiu, Xiaoyu Tan, Jing Pan, Chen Jue, Zhijun Fang, Yinghui Xu, Wei Chu, Yuan Qi

TL;DR

The paper addresses implicit bias in large language models by formalizing the problem and proposing BTBR, a Bayesian-theory based bias removal framework. BTBR identifies bias-prone data using likelihood-ratio screening, estimates biased samples through a fine-tuned surrogate, and deletes bias by editing canonical triples with MEMIT in a black-box fashion, avoiding full retraining. Empirical results on Llama3-8B-Instruct show BTBR reduces implicit bias across diverse tasks while preserving performance, with ablations underscoring the importance of precise sample selection over brute force data removal. The approach advances fairness in LLMs by enabling targeted, data-driven bias elimination that leverages public datasets and model-editing techniques, offering practical guidance for deploying fairer LLMs in real-world settings.

Abstract

Large language models (LLMs) are trained on extensive text corpora, which inevitably include biased information. Although techniques such as Affective Alignment can mitigate some negative impacts of these biases, existing prompt-based attack methods can still extract these biases from the model's weights. Moreover, these biases frequently appear subtly when LLMs are prompted to perform identical tasks across different demographic groups, thereby camouflaging their presence. To address this issue, we have formally defined the implicit bias problem and developed an innovative framework for bias removal based on Bayesian theory, Bayesian-Theory based Bias Removal (BTBR). BTBR employs likelihood ratio screening to pinpoint data entries within publicly accessible biased datasets that represent biases inadvertently incorporated during the LLM training phase. It then automatically constructs relevant knowledge triples and expunges bias information from LLMs using model editing techniques. Through extensive experimentation, we have confirmed the presence of the implicit bias problem in LLMs and demonstrated the effectiveness of our BTBR approach.

Promoting Equality in Large Language Models: Identifying and Mitigating the Implicit Bias based on Bayesian Theory

TL;DR

The paper addresses implicit bias in large language models by formalizing the problem and proposing BTBR, a Bayesian-theory based bias removal framework. BTBR identifies bias-prone data using likelihood-ratio screening, estimates biased samples through a fine-tuned surrogate, and deletes bias by editing canonical triples with MEMIT in a black-box fashion, avoiding full retraining. Empirical results on Llama3-8B-Instruct show BTBR reduces implicit bias across diverse tasks while preserving performance, with ablations underscoring the importance of precise sample selection over brute force data removal. The approach advances fairness in LLMs by enabling targeted, data-driven bias elimination that leverages public datasets and model-editing techniques, offering practical guidance for deploying fairer LLMs in real-world settings.

Abstract

Large language models (LLMs) are trained on extensive text corpora, which inevitably include biased information. Although techniques such as Affective Alignment can mitigate some negative impacts of these biases, existing prompt-based attack methods can still extract these biases from the model's weights. Moreover, these biases frequently appear subtly when LLMs are prompted to perform identical tasks across different demographic groups, thereby camouflaging their presence. To address this issue, we have formally defined the implicit bias problem and developed an innovative framework for bias removal based on Bayesian theory, Bayesian-Theory based Bias Removal (BTBR). BTBR employs likelihood ratio screening to pinpoint data entries within publicly accessible biased datasets that represent biases inadvertently incorporated during the LLM training phase. It then automatically constructs relevant knowledge triples and expunges bias information from LLMs using model editing techniques. Through extensive experimentation, we have confirmed the presence of the implicit bias problem in LLMs and demonstrated the effectiveness of our BTBR approach.
Paper Structure (16 sections, 8 equations, 3 figures, 2 tables)

This paper contains 16 sections, 8 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Diagram of Implicit Bias in LLMs. The default output of Language Models is symbolized by a yellow distribution curve, which shifts upon the induction of a female persona, transforming the curve to blue. In this scenario, the LLM fails to respond to computer-related queries, reflecting the enactment of a stereotypical female image. Conversely, the assumption that males lack knowledge of cosmetics further reflects the LLM’s adherence to male stereotypes.
  • Figure 2: Diagram of Bias Induction Techniques. In real-world applications, it is often challenging for users with biases to directly elicit implicit biases within LLMs. Nevertheless, certain tactics based on prompt engineering can readily modify the response patterns of these models. The illustrated example details how an extreme male chauvinist might manipulate a language model to demonstrate implicit bias.
  • Figure 3: Visualization of DB Values. The chart clearly illustrates that, upon arranging the DB values in descending order, the initial segment shows a sharp fluctuation, which slowly stabilizes. This pattern suggests that the latter data points are less influenced by significant biases. The demarcation is approximately at an index of 34. To mitigate the risk of removing too much data, we have opted for $K=30$.