fairBERTs: Erasing Sensitive Information Through Semantic and Fairness-aware Perturbations

Jinfeng Li; Yuefeng Chen; Xiangyu Liu; Longtao Huang; Rong Zhang; Hui Xue

fairBERTs: Erasing Sensitive Information Through Semantic and Fairness-aware Perturbations

Jinfeng Li, Yuefeng Chen, Xiangyu Liu, Longtao Huang, Rong Zhang, Hui Xue

TL;DR

This paper tackles the challenge of protected-s attribute biases in pre-trained language models by introducing fairBERTs, a GAN-based framework that perturbs BERT-style hidden representations to erase sensitive information while preserving task performance. It achieves this by generating semantic and fairness-aware perturbations using a generator G that operates on the semantic-rich sequence representation h_s to produce h_c^F = h_c + G(h_s), and training with adversarial discriminators to suppress z predictability without sacrificing accuracy. Empirical results on toxicity detection and sentiment analysis show improved fairness across metrics with minimal utility loss, and the perturbations demonstrate transferability to vanilla BERT-like models, suggesting practical applicability. The work advances fair fine-tuning of PLMs and opens avenues for deploying fairer models across diverse NLP tasks without substantial retraining costs.

Abstract

Pre-trained language models (PLMs) have revolutionized both the natural language processing research and applications. However, stereotypical biases (e.g., gender and racial discrimination) encoded in PLMs have raised negative ethical implications for PLMs, which critically limits their broader applications. To address the aforementioned unfairness issues, we present fairBERTs, a general framework for learning fair fine-tuned BERT series models by erasing the protected sensitive information via semantic and fairness-aware perturbations generated by a generative adversarial network. Through extensive qualitative and quantitative experiments on two real-world tasks, we demonstrate the great superiority of fairBERTs in mitigating unfairness while maintaining the model utility. We also verify the feasibility of transferring adversarial components in fairBERTs to other conventionally trained BERT-like models for yielding fairness improvements. Our findings may shed light on further research on building fairer fine-tuned PLMs.

fairBERTs: Erasing Sensitive Information Through Semantic and Fairness-aware Perturbations

TL;DR

Abstract

Paper Structure (12 sections, 8 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 12 sections, 8 equations, 5 figures, 3 tables, 1 algorithm.

Introduction
Related Works
Methodology
Problem Definition
Method
Learning of fairBERTs
Experiments
Experimental Setup
Qualitative Evaluation of Fairness
Quantitative Evaluation of Fairness
Evaluation of Transferability
Conclusion

Figures (5)

Figure 1: Illustration of unfairness involved in a BERT model deployed on HuggingFace. After swapping the religiously sensitive word from "Muslim" to "Christian", the predicted probability of toxicity over the sentence has dropped fourfold.
Figure 2: The framework of fairBERTs.
Figure 3: Visualization of interpretations given by LIME on vanilla BERT and fairBERTs over two cases.
Figure 4: Comparison of intermediate model representation and the $\texttt{fairBERTs}\xspace^{*}$ denotes the latent representation before adding semantic and fairness-aware perturbations.
Figure 5: Comparison of sensitive words in the top-3 important decision words given by LIME.

fairBERTs: Erasing Sensitive Information Through Semantic and Fairness-aware Perturbations

TL;DR

Abstract

fairBERTs: Erasing Sensitive Information Through Semantic and Fairness-aware Perturbations

Authors

TL;DR

Abstract

Table of Contents

Figures (5)