Table of Contents
Fetching ...

Sustainable Modular Debiasing of Language Models

Anne Lauscher, Tobias Lüken, Goran Glavaš

TL;DR

Pretrained language models encode societal biases, and existing debiasing methods are costly and prone to forgetting. The authors introduce ADELE, a modular adapter-based debiasing framework that trains only lightweight adapters using counterfactually augmented data while keeping the original model frozen. ADELE achieves effective bias attenuation across multiple intrinsic and extrinsic benchmarks, supports zero-shot multilingual transfer to six languages, and can be augmented with task adapters (Adele-TA) to mitigate fairness forgetting during downstream fine-tuning. The work offers a scalable, energy-efficient path toward fairer and more inclusive language technology with strong cross-lingual potential.

Abstract

Unfair stereotypical biases (e.g., gender, racial, or religious biases) encoded in modern pretrained language models (PLMs) have negative ethical implications for widespread adoption of state-of-the-art language technology. To remedy for this, a wide range of debiasing techniques have recently been introduced to remove such stereotypical biases from PLMs. Existing debiasing methods, however, directly modify all of the PLMs parameters, which -- besides being computationally expensive -- comes with the inherent risk of (catastrophic) forgetting of useful language knowledge acquired in pretraining. In this work, we propose a more sustainable modular debiasing approach based on dedicated debiasing adapters, dubbed ADELE. Concretely, we (1) inject adapter modules into the original PLM layers and (2) update only the adapters (i.e., we keep the original PLM parameters frozen) via language modeling training on a counterfactually augmented corpus. We showcase ADELE, in gender debiasing of BERT: our extensive evaluation, encompassing three intrinsic and two extrinsic bias measures, renders ADELE, very effective in bias mitigation. We further show that -- due to its modular nature -- ADELE, coupled with task adapters, retains fairness even after large-scale downstream training. Finally, by means of multilingual BERT, we successfully transfer ADELE, to six target languages.

Sustainable Modular Debiasing of Language Models

TL;DR

Pretrained language models encode societal biases, and existing debiasing methods are costly and prone to forgetting. The authors introduce ADELE, a modular adapter-based debiasing framework that trains only lightweight adapters using counterfactually augmented data while keeping the original model frozen. ADELE achieves effective bias attenuation across multiple intrinsic and extrinsic benchmarks, supports zero-shot multilingual transfer to six languages, and can be augmented with task adapters (Adele-TA) to mitigate fairness forgetting during downstream fine-tuning. The work offers a scalable, energy-efficient path toward fairer and more inclusive language technology with strong cross-lingual potential.

Abstract

Unfair stereotypical biases (e.g., gender, racial, or religious biases) encoded in modern pretrained language models (PLMs) have negative ethical implications for widespread adoption of state-of-the-art language technology. To remedy for this, a wide range of debiasing techniques have recently been introduced to remove such stereotypical biases from PLMs. Existing debiasing methods, however, directly modify all of the PLMs parameters, which -- besides being computationally expensive -- comes with the inherent risk of (catastrophic) forgetting of useful language knowledge acquired in pretraining. In this work, we propose a more sustainable modular debiasing approach based on dedicated debiasing adapters, dubbed ADELE. Concretely, we (1) inject adapter modules into the original PLM layers and (2) update only the adapters (i.e., we keep the original PLM parameters frozen) via language modeling training on a counterfactually augmented corpus. We showcase ADELE, in gender debiasing of BERT: our extensive evaluation, encompassing three intrinsic and two extrinsic bias measures, renders ADELE, very effective in bias mitigation. We further show that -- due to its modular nature -- ADELE, coupled with task adapters, retains fairness even after large-scale downstream training. Finally, by means of multilingual BERT, we successfully transfer ADELE, to six target languages.

Paper Structure

This paper contains 37 sections, 7 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: WEAT bias effect heatmaps for (a) original BERT$_{Base}$, and the debiased BERTs, (b) BERT$_\text{Adele{}}$, (c) Zari$_{CDA}$webster2020measuring, and (d) Zari$_{CDA}$, for word embeddings averaged over different subsets of layers $[m:n]$. E.g., $[0:0]$ points to word embeddings directly obtained from BERT's (sub)word embeddings (layer $0$); $[1:7]$ indicates word vectors obtained by averaging word representations after Transformer layers 1 through 7.
  • Figure 2: XWEAT effect sizes heat maps for (a) original mBERT, and the debiased (b) mBERT$_{\textsc{Adele}}$ in seven languages (source language en, and transfer languages de, es, it, hr, ru, tr), for word embeddings averaged over different subsets of layers $[m:n]$. E.g., $[0:0]$ points to word embeddings directly obtained from BERT's (sub)word embeddings (layer $0$); $[1:7]$ indicates word vectors obtained by averaging word representations after Transformer layers 1 through 7. Lighter colors indicate less bias.
  • Figure 3: Bias and performance over time for different size of downstream (MNLI) training sets (#instances). We report mean and the 95% confidence interval over five runs for Net Neutral (NN) on Bias-NLI and Accuracy (Acc) on the MNLI matched development set.