fairBERTs: Erasing Sensitive Information Through Semantic and Fairness-aware Perturbations
Jinfeng Li, Yuefeng Chen, Xiangyu Liu, Longtao Huang, Rong Zhang, Hui Xue
TL;DR
This paper tackles the challenge of protected-s attribute biases in pre-trained language models by introducing fairBERTs, a GAN-based framework that perturbs BERT-style hidden representations to erase sensitive information while preserving task performance. It achieves this by generating semantic and fairness-aware perturbations using a generator G that operates on the semantic-rich sequence representation h_s to produce h_c^F = h_c + G(h_s), and training with adversarial discriminators to suppress z predictability without sacrificing accuracy. Empirical results on toxicity detection and sentiment analysis show improved fairness across metrics with minimal utility loss, and the perturbations demonstrate transferability to vanilla BERT-like models, suggesting practical applicability. The work advances fair fine-tuning of PLMs and opens avenues for deploying fairer models across diverse NLP tasks without substantial retraining costs.
Abstract
Pre-trained language models (PLMs) have revolutionized both the natural language processing research and applications. However, stereotypical biases (e.g., gender and racial discrimination) encoded in PLMs have raised negative ethical implications for PLMs, which critically limits their broader applications. To address the aforementioned unfairness issues, we present fairBERTs, a general framework for learning fair fine-tuned BERT series models by erasing the protected sensitive information via semantic and fairness-aware perturbations generated by a generative adversarial network. Through extensive qualitative and quantitative experiments on two real-world tasks, we demonstrate the great superiority of fairBERTs in mitigating unfairness while maintaining the model utility. We also verify the feasibility of transferring adversarial components in fairBERTs to other conventionally trained BERT-like models for yielding fairness improvements. Our findings may shed light on further research on building fairer fine-tuned PLMs.
