Exploring and Mitigating Gender Bias in Encoder-Based Transformer Models
Ariyan Hossain, Khondokar Mohammad Ahanaf Hannan, Rakinul Haque, Nowreen Tarannum Rafa, Humayra Musarrat, Shoaib Ahmed Dipu, Farig Yousuf Sadeque
TL;DR
This work examines gender bias in encoder-based transformer models, focusing on contextualized word embeddings produced by architectures like BERT, RoBERTa, ALBERT, and DistilBERT. It introduces MALoR, a model-agnostic metric based on mean absolute log-ratio of MLM token probabilities across gendered terms and professions, and validates it across three experiments: he–she, his–her, and male–female names. To mitigate bias, the authors perform Counterfactual Data Augmentation to create gender-balanced corpora and continue pretraining the models, reporting substantial reductions in MALoR scores while preserving SST-2 performance (no significant degradation). The study highlights model size and vocabulary as factors in debiasing effectiveness and provides a reproducible methodology, including detailed sentence templates and datasets, for evaluating and reducing gender bias in contextualized embeddings. This approach offers a practical, data-efficient pathway to fairer transformer-based systems in downstream NLP tasks.
Abstract
Gender bias in language models has gained increasing attention in the field of natural language processing. Encoder-based transformer models, which have achieved state-of-the-art performance in various language tasks, have been shown to exhibit strong gender biases inherited from their training data. This paper investigates gender bias in contextualized word embeddings, a crucial component of transformer-based models. We focus on prominent architectures such as BERT, ALBERT, RoBERTa, and DistilBERT to examine their vulnerability to gender bias. To quantify the degree of bias, we introduce a novel metric, MALoR, which assesses bias based on model probabilities for filling masked tokens. We further propose a mitigation approach involving continued pre-training on a gender-balanced dataset generated via Counterfactual Data Augmentation. Our experiments reveal significant reductions in gender bias scores across different pronoun pairs. For instance, in BERT-base, bias scores for "he-she" dropped from 1.27 to 0.08, and "his-her" from 2.51 to 0.36 following our mitigation approach. We also observed similar improvements across other models, with "male-female" bias decreasing from 1.82 to 0.10 in BERT-large. Our approach effectively reduces gender bias without compromising model performance on downstream tasks.
