Table of Contents
Fetching ...

Debiasing Pre-trained Contextualised Embeddings

Masahiro Kaneko, Danushka Bollegala

TL;DR

The paper tackles gender bias in contextualised word embeddings by introducing a fine-tuning debiasing method that is model-agnostic and operable at token- or sentence-level. It combines a bias-orthogonalisation loss $L_i$ with a regulariser $L_{reg}$ to remove protected attribute information while preserving pre-trained semantics, optimized as $L = \alpha L_i + \beta L_{reg}$ across layers and settings. Empirical results on SEAT and MNLI show that token-level debiasing across all layers most effectively reduces bias without sacrificing downstream task performance, though some models (e.g., RoBERTa, ALBERT) exhibit sensitivity. The work demonstrates that debiasing requires multi-layer intervention and provides a practical, scalable approach to fairer contextualised embeddings without full re-training.

Abstract

In comparison to the numerous debiasing methods proposed for the static non-contextualised word embeddings, the discriminative biases in contextualised embeddings have received relatively little attention. We propose a fine-tuning method that can be applied at token- or sentence-levels to debias pre-trained contextualised embeddings. Our proposed method can be applied to any pre-trained contextualised embedding model, without requiring to retrain those models. Using gender bias as an illustrative example, we then conduct a systematic study using several state-of-the-art (SoTA) contextualised representations on multiple benchmark datasets to evaluate the level of biases encoded in different contextualised embeddings before and after debiasing using the proposed method. We find that applying token-level debiasing for all tokens and across all layers of a contextualised embedding model produces the best performance. Interestingly, we observe that there is a trade-off between creating an accurate vs. unbiased contextualised embedding model, and different contextualised embedding models respond differently to this trade-off.

Debiasing Pre-trained Contextualised Embeddings

TL;DR

The paper tackles gender bias in contextualised word embeddings by introducing a fine-tuning debiasing method that is model-agnostic and operable at token- or sentence-level. It combines a bias-orthogonalisation loss with a regulariser to remove protected attribute information while preserving pre-trained semantics, optimized as across layers and settings. Empirical results on SEAT and MNLI show that token-level debiasing across all layers most effectively reduces bias without sacrificing downstream task performance, though some models (e.g., RoBERTa, ALBERT) exhibit sensitivity. The work demonstrates that debiasing requires multi-layer intervention and provides a practical, scalable approach to fairer contextualised embeddings without full re-training.

Abstract

In comparison to the numerous debiasing methods proposed for the static non-contextualised word embeddings, the discriminative biases in contextualised embeddings have received relatively little attention. We propose a fine-tuning method that can be applied at token- or sentence-levels to debias pre-trained contextualised embeddings. Our proposed method can be applied to any pre-trained contextualised embedding model, without requiring to retrain those models. Using gender bias as an illustrative example, we then conduct a systematic study using several state-of-the-art (SoTA) contextualised representations on multiple benchmark datasets to evaluate the level of biases encoded in different contextualised embeddings before and after debiasing using the proposed method. We find that applying token-level debiasing for all tokens and across all layers of a contextualised embedding model produces the best performance. Interestingly, we observe that there is a trade-off between creating an accurate vs. unbiased contextualised embedding model, and different contextualised embedding models respond differently to this trade-off.

Paper Structure

This paper contains 14 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Types of hidden states in $E$ considered in the proposed method. The blue boxes in the middle correspond to the hidden states of the target token.
  • Figure 2: Scatter plot of gender information of hidden states for original and debiased stereotype words.