Context-Aware Counterfactual Data Augmentation for Gender Bias Mitigation in Language Models

Shweta Parihar; Liu Guangliang; Natalie Parde; Lu Cheng

Context-Aware Counterfactual Data Augmentation for Gender Bias Mitigation in Language Models

Shweta Parihar, Liu Guangliang, Natalie Parde, Lu Cheng

TL;DR

This paper tackles gender bias in language models by addressing a key weakness of standard counterfactual data augmentation (CDA): degradation of language modeling due to distribution drift and context-insensitive counterfactuals. It introduces Context-CDA, which uses large LMs to generate context-rich counterfactuals and applies semantic-entropy-based filtering to remove high-uncertainty examples before fine-tuning small target LMs. Across five diverse architectures, Context-CDA reduces intrinsic bias (StereoSet, CrowS-Pairs) while preserving or improving language modeling performance (LMS, ICAT) and maintains extrinsic bias and downstream task performance. The method is model-agnostic, converges robustly around epoch 75–85, and offers insights into gender bias through next-token distribution analysis, with the potential for broader, multilingual, and domain-specific extensions.

Abstract

A challenge in mitigating social bias in fine-tuned language models (LMs) is the potential reduction in language modeling capability, which can harm downstream performance. Counterfactual data augmentation (CDA), a widely used method for fine-tuning, highlights this issue by generating synthetic data that may align poorly with real-world distributions or creating overly simplistic counterfactuals that ignore the social context of altered sensitive attributes (e.g., gender) in the pretraining corpus. To address these limitations, we propose a simple yet effective context-augmented CDA method, Context-CDA, which uses large LMs to enhance the diversity and contextual relevance of the debiasing corpus. By minimizing discrepancies between the debiasing corpus and pretraining data through augmented context, this approach ensures better alignment, enhancing language modeling capability. We then employ uncertainty-based filtering to exclude generated counterfactuals considered low-quality by the target smaller LMs (i.e., LMs to be debiased), further improving the fine-tuning corpus quality. Experimental results on gender bias benchmarks demonstrate that Context-CDA effectively mitigates bias without sacrificing language modeling performance while offering insights into social biases by analyzing distribution shifts in next-token generation probabilities.

Context-Aware Counterfactual Data Augmentation for Gender Bias Mitigation in Language Models

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 42 figures, 1 table)

This paper contains 17 sections, 1 equation, 42 figures, 1 table.

Introduction
Related Work
Methods
Context-Aware CDA
Uncertainty-Based Filtering
Debiasing via Fine-tuning on Filtered Context-CDA
Experiments
Experimental Setup
Post Debiasing Performance on Intrinsic Bias Scores and Language Modeling
Multi-Model Evaluation: Demonstrating Robustness and Generalizability
Convergence Patterns
Comparison with Prior Debiasing Methods
Post Debiasing Performance on Extrinsic Bias and Downstream Tasks
Next-Token Distribution
Limitations
...and 2 more sections

Figures (42)

Figure 1: An illustration of the proposed Context-CDA pipeline using the StereoSet benchmark nadeem2020stereoset with BERT devlin2019bert. Step 1: Flip the gender words. Step 2: Use a larger LM (e.g., Llama-3-8B-Instruct grattafiori2024llama) with the system and instruction prompt to get the augmented data. Step 3: Use the target small LM (e.g., BERT) to calculate the semantic entropy of augmented data. Step 4: Filter the counterfactuals based on the semantic entropy. Step 5: Debias the target small LM.
Figure 2: StereoSet bias score for BERT (Intrinsic bias). 50 indicates no bias.
Figure 3: CrowS-Pairs bias score for BERT (Intrinsic bias). 50 indicates no bias.
Figure 4: StereoSet bias score for DistilBERT.
Figure 5: CrowS-Pairs bias score for DistilBERT.
...and 37 more figures

Context-Aware Counterfactual Data Augmentation for Gender Bias Mitigation in Language Models

TL;DR

Abstract

Context-Aware Counterfactual Data Augmentation for Gender Bias Mitigation in Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (42)