Table of Contents
Fetching ...

Perturbation Augmentation for Fairer NLP

Rebecca Qian, Candace Ross, Jude Fernandes, Eric Smith, Douwe Kiela, Adina Williams

TL;DR

This work investigates whether training on demographically perturbed data can reduce biases in NLP models. It introduces PANDA, a large human-annotated dataset of demographic perturbations, and a neural perturber trained on PANDA to generate fluent, controllable rewrites; this enables perturbation augmentation during pretraining (FairBERTa) and finetuning (fairtuning). Empirical results show improved fairness across multiple metrics and tasks with minimal or no loss in downstream performance, and the authors propose fairscore as an extrinsic measure of fairness. The study also discusses broader implications, potential pitfalls (e.g., fairwashing, factuality), and limitations related to demographic categories and data sourcing, outlining directions for future work in fairer NLP.

Abstract

Unwanted and often harmful social biases are becoming ever more salient in NLP research, affecting both models and datasets. In this work, we ask whether training on demographically perturbed data leads to fairer language models. We collect a large dataset of human annotated text perturbations and train a neural perturbation model, which we show outperforms heuristic alternatives. We find that (i) language models (LMs) pre-trained on demographically perturbed corpora are typically more fair, and (ii) LMs finetuned on perturbed GLUE datasets exhibit less demographic bias on downstream tasks, and (iii) fairness improvements do not come at the expense of performance on downstream tasks. Lastly, we discuss outstanding questions about how best to evaluate the (un)fairness of large language models. We hope that this exploration of neural demographic perturbation will help drive more improvement towards fairer NLP.

Perturbation Augmentation for Fairer NLP

TL;DR

This work investigates whether training on demographically perturbed data can reduce biases in NLP models. It introduces PANDA, a large human-annotated dataset of demographic perturbations, and a neural perturber trained on PANDA to generate fluent, controllable rewrites; this enables perturbation augmentation during pretraining (FairBERTa) and finetuning (fairtuning). Empirical results show improved fairness across multiple metrics and tasks with minimal or no loss in downstream performance, and the authors propose fairscore as an extrinsic measure of fairness. The study also discusses broader implications, potential pitfalls (e.g., fairwashing, factuality), and limitations related to demographic categories and data sourcing, outlining directions for future work in fairer NLP.

Abstract

Unwanted and often harmful social biases are becoming ever more salient in NLP research, affecting both models and datasets. In this work, we ask whether training on demographically perturbed data leads to fairer language models. We collect a large dataset of human annotated text perturbations and train a neural perturbation model, which we show outperforms heuristic alternatives. We find that (i) language models (LMs) pre-trained on demographically perturbed corpora are typically more fair, and (ii) LMs finetuned on perturbed GLUE datasets exhibit less demographic bias on downstream tasks, and (iii) fairness improvements do not come at the expense of performance on downstream tasks. Lastly, we discuss outstanding questions about how best to evaluate the (un)fairness of large language models. We hope that this exploration of neural demographic perturbation will help drive more improvement towards fairer NLP.
Paper Structure (56 sections, 2 equations, 6 figures, 14 tables, 1 algorithm)

This paper contains 56 sections, 2 equations, 6 figures, 14 tables, 1 algorithm.

Figures (6)

  • Figure 1: Our contributions. refers to our large scale annotated dataset (PANDA) of demographic perturbations. Our perturber in is trained on PANDA to generate high quality perturbed text. In , we train a LM on data that has been augmented using the perturber. In , we illustrate a method for finetuning on perturbation augmented validation data, which we call fairtuning. Finally, we propose the fairscore , an extrinsic metric that quantifies fairness in LMs as robustness to demographic perturbation.
  • Figure 2: Breakdown of demographic axes and source data types in PANDA. 'Wiki' refers to Wikipedia and 'BC' refers to BookCorpus. The x-axis shows number of examples for each attribute. Analysis is shown for the rewritten examples.
  • Figure 3: Examples perturbed with heuristic approaches (AugLy and TextFlint), or the perturber (changed words highlighted); TextFlint did not perturb any words.
  • Figure 4: Design of Stage 1 of data collection, in which annotators select demographic terms in a text snippet.
  • Figure 5: Design of Stage 2 of data collection, in which annotators assign attributes to demographic words selected during Stage 1.
  • ...and 1 more figures