Table of Contents
Fetching ...

Mitigating Biases in Language Models via Bias Unlearning

Dianqing Liu, Yi Liu, Guoqing Jin, Zhendong Mao

TL;DR

BiasUnlearn tackles the challenge of mitigating social biases in language models without sacrificing core capabilities. It introduces a dual-pathway unlearning strategy that simultaneously forgets stereotypes and retains anti-stereotypes, augmented by an adversarial forget set and dynamic dataset swapping to prevent bias reversal, with a loss function $\,\mathcal{L}=\alpha_{1}\mathcal{L}_{Forget}+\alpha_{2}\mathcal{L}_{Retention}+\alpha_{3}\mathcal{L}_{KL}$ and efficiency via LoRA. Across multiple base and instruction-tuned models and a broad suite of benchmarks (StereoSet, Crows-Pairs, BBQ, GLUE, MT-bench, FairMT, CEB), BiasUnlearn achieves significant bias reduction while preserving language modeling performance, and debiasing weights transfer across model variants, suggesting bias representations are entrenched during pre-training. The work provides practical, model-agnostic tools for fairer LLM deployment and offers theoretical insight into the persistence and transferability of bias representations. Overall, BiasUnlearn advances fairness in LLMs with scalable, transferable debiasing that maintains utility across tasks and model families.

Abstract

Many studies have shown various biases targeting different demographic groups in language models, amplifying discrimination and harming fairness. Recent parameter modification debiasing approaches significantly degrade core capabilities such as text coherence and task accuracy. And Prompt-based debiasing methods, only effective for predefined trigger words, fail to address deeply embedded stereotypical associations in model parameters. In this paper, we propose BiasUnlearn, a novel model debiasing framework which achieves targeted debiasing via dual-pathway unlearning mechanisms coordinating stereotype forgetting with anti-stereotype retention, while preventing bias polarity reversal through adversarial forget set and dynamic dataset swapping. We conducted extensive experiments with multiple language models across various evaluation benchmarks. The results show that BiasUnlearn outperforms existing methods in mitigating bias in language models while retaining language modeling capabilities. Further experiments reveal that debiasing weights are transferable across model variants, confirming that bias representations become entrenched during pre-training and persist through fine-tuning phases.

Mitigating Biases in Language Models via Bias Unlearning

TL;DR

BiasUnlearn tackles the challenge of mitigating social biases in language models without sacrificing core capabilities. It introduces a dual-pathway unlearning strategy that simultaneously forgets stereotypes and retains anti-stereotypes, augmented by an adversarial forget set and dynamic dataset swapping to prevent bias reversal, with a loss function and efficiency via LoRA. Across multiple base and instruction-tuned models and a broad suite of benchmarks (StereoSet, Crows-Pairs, BBQ, GLUE, MT-bench, FairMT, CEB), BiasUnlearn achieves significant bias reduction while preserving language modeling performance, and debiasing weights transfer across model variants, suggesting bias representations are entrenched during pre-training. The work provides practical, model-agnostic tools for fairer LLM deployment and offers theoretical insight into the persistence and transferability of bias representations. Overall, BiasUnlearn advances fairness in LLMs with scalable, transferable debiasing that maintains utility across tasks and model families.

Abstract

Many studies have shown various biases targeting different demographic groups in language models, amplifying discrimination and harming fairness. Recent parameter modification debiasing approaches significantly degrade core capabilities such as text coherence and task accuracy. And Prompt-based debiasing methods, only effective for predefined trigger words, fail to address deeply embedded stereotypical associations in model parameters. In this paper, we propose BiasUnlearn, a novel model debiasing framework which achieves targeted debiasing via dual-pathway unlearning mechanisms coordinating stereotype forgetting with anti-stereotype retention, while preventing bias polarity reversal through adversarial forget set and dynamic dataset swapping. We conducted extensive experiments with multiple language models across various evaluation benchmarks. The results show that BiasUnlearn outperforms existing methods in mitigating bias in language models while retaining language modeling capabilities. Further experiments reveal that debiasing weights are transferable across model variants, confirming that bias representations become entrenched during pre-training and persist through fine-tuning phases.

Paper Structure

This paper contains 22 sections, 5 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Demonstration of BiasUnlearn framework.
  • Figure 2: Bias scores in each category of BBQ, split by whether the context is ambiguous or disambiguated. The higher the bias score, the stronger the bias.
  • Figure 3: Forgetting loss and Retention loss of BiasUnlearn with or without $\mathcal{L}_{Retention}$, $\mathcal{L}_{KL}$ or Adversarial Forget Set.
  • Figure 4: Comparison results of BiasUnlearn with SFT and DPO.