Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias Mitigation

Huimin Lu; Masaru Isonuma; Junichiro Mori; Ichiro Sakata

Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias Mitigation

Huimin Lu, Masaru Isonuma, Junichiro Mori, Ichiro Sakata

TL;DR

The paper tackles bias propagation in large language models by proposing Masked Language Modeling (MLM) Unlearning, an unlearning-based debiasing method that performs gradient ascent on biased content to minimize its likelihood while retaining language modeling performance. The authors formalize MLM Unlearning with the loss $\mathcal{L}_{MLM\, Unlearning} = \log P(x_i \vert X \setminus x_i)$ to dissociate biased tokens from their contexts and demonstrate this approach on a GPT-2 backbone using a dynamically generated hate speech dataset that targets women. They show that debiasing in one domain (e.g., gender) can transfer to other domains (e.g., race, religion), indicating cross-domain transfer unlearning and potential for universal debiasing with broad applicability. The evaluation, using Wikitext-2 perplexity and bias benchmarks (CrowS-Pairs, StereoSet), indicates that MLM Unlearning preserves language modeling quality while effectively reducing bias, suggesting practical impact for fairer LLM deployment and broader generalization across bias domains.

Abstract

Large language models (LLMs) often inherit biases from vast amounts of training corpora. Traditional debiasing methods, while effective to some extent, do not completely eliminate memorized biases and toxicity in LLMs. In this paper, we study an unlearning-based approach to debiasing in LLMs by performing gradient ascent on hate speech against minority groups, i.e., minimizing the likelihood of biased or toxic content. Specifically, we propose a mask language modeling unlearning technique, which unlearns the harmful part of the text. This method enables LLMs to selectively forget and disassociate from biased and harmful content. Experimental results demonstrate the effectiveness of our approach in diminishing bias while maintaining the language modeling abilities. Surprisingly, the results also unveil an unexpected potential for cross-domain transfer unlearning: debiasing in one bias form (e.g. gender) may contribute to mitigating others (e.g. race and religion).

Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias Mitigation

TL;DR

to dissociate biased tokens from their contexts and demonstrate this approach on a GPT-2 backbone using a dynamically generated hate speech dataset that targets women. They show that debiasing in one domain (e.g., gender) can transfer to other domains (e.g., race, religion), indicating cross-domain transfer unlearning and potential for universal debiasing with broad applicability. The evaluation, using Wikitext-2 perplexity and bias benchmarks (CrowS-Pairs, StereoSet), indicates that MLM Unlearning preserves language modeling quality while effectively reducing bias, suggesting practical impact for fairer LLM deployment and broader generalization across bias domains.

Abstract

Paper Structure (27 sections, 1 equation, 1 figure, 2 tables)

This paper contains 27 sections, 1 equation, 1 figure, 2 tables.

Introduction
Related Work
Counterfactual Data Augmentation (CDA)
SentenceDebias
Iterative Nullspace Projection (INLP)
Self-Debias
Bias and Toxicity Unlearning
Masked Language Unlearning
Experiment
Experimental Setup
Unlearning Dataset
Implementation Details
Evaluation Setup
Wikitext-2
CrowS-Pairs
...and 12 more sections

Figures (1)

Figure 1: Perplexity Results across Different Unlearning Steps

Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias Mitigation

TL;DR

Abstract

Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias Mitigation

Authors

TL;DR

Abstract

Table of Contents

Figures (1)