Gender Encoding Patterns in Pretrained Language Model Representations
Mahdi Zakizadeh, Mohammad Taher Pilehvar
TL;DR
The paper investigates how gender bias is encoded in encoder-based pretrained language models and how different debiasing and fine-tuning strategies affect these internal representations. It applies Minimum Description Length (MDL) probing to quantify gender-information compression across layers in models such as BERT, RoBERTa, and JINA embeddings, using the Bias in Bios dataset. A key finding is the robust two-phase pattern: early layers suppress gender signals, while final layers amplify them, making internal bias hard to eradicate, especially with post-hoc methods. The results highlight that training-time debiasing and integrating fairness objectives during pretraining are more effective than post-hoc fixes, underscoring the need for architecture-aware mitigation strategies to reduce bias propagation in downstream tasks such as retrieval and classification.
Abstract
Gender bias in pretrained language models (PLMs) poses significant social and ethical challenges. Despite growing awareness, there is a lack of comprehensive investigation into how different models internally represent and propagate such biases. This study adopts an information-theoretic approach to analyze how gender biases are encoded within various encoder-based architectures. We focus on three key aspects: identifying how models encode gender information and biases, examining the impact of bias mitigation techniques and fine-tuning on the encoded biases and their effectiveness, and exploring how model design differences influence the encoding of biases. Through rigorous and systematic investigation, our findings reveal a consistent pattern of gender encoding across diverse models. Surprisingly, debiasing techniques often exhibit limited efficacy, sometimes inadvertently increasing the encoded bias in internal representations while reducing bias in model output distributions. This highlights a disconnect between mitigating bias in output distributions and addressing its internal representations. This work provides valuable guidance for advancing bias mitigation strategies and fostering the development of more equitable language models.
