Table of Contents
Fetching ...

The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models

Yan Liu, Yu Liu, Xiaokang Chen, Pin-Yu Chen, Daoguang Zan, Min-Yen Kan, Tsung-Yi Ho

TL;DR

This work tries to unveil the mystery of social bias inside language models by introducing the concept of Social Bias Neurons and proposes Integrated Gap Gradients to accurately pinpoint units in a language model that can be attributed to undesirable behavior, such as social bias.

Abstract

Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases, which may cause negative social impacts or even bring catastrophic results in application. Previous works on this problem mainly focused on using black-box methods such as probing to detect and quantify social biases in PLMs by observing model outputs. As a result, previous debiasing methods mainly finetune or even pre-train language models on newly constructed anti-stereotypical datasets, which are high-cost. In this work, we try to unveil the mystery of social bias inside language models by introducing the concept of {\sc Social Bias Neurons}. Specifically, we propose {\sc Integrated Gap Gradients (IG$^2$)} to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias. By formalizing undesirable behavior as a distributional property of language, we employ sentiment-bearing prompts to elicit classes of sensitive words (demographics) correlated with such sentiments. Our IG$^2$ thus attributes the uneven distribution for different demographics to specific Social Bias Neurons, which track the trail of unwanted behavior inside PLM units to achieve interoperability. Moreover, derived from our interpretable technique, {\sc Bias Neuron Suppression (BNS)} is further proposed to mitigate social biases. By studying BERT, RoBERTa, and their attributable differences from debiased FairBERTa, IG$^2$ allows us to locate and suppress identified neurons, and further mitigate undesired behaviors. As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost.

The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models

TL;DR

This work tries to unveil the mystery of social bias inside language models by introducing the concept of Social Bias Neurons and proposes Integrated Gap Gradients to accurately pinpoint units in a language model that can be attributed to undesirable behavior, such as social bias.

Abstract

Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases, which may cause negative social impacts or even bring catastrophic results in application. Previous works on this problem mainly focused on using black-box methods such as probing to detect and quantify social biases in PLMs by observing model outputs. As a result, previous debiasing methods mainly finetune or even pre-train language models on newly constructed anti-stereotypical datasets, which are high-cost. In this work, we try to unveil the mystery of social bias inside language models by introducing the concept of {\sc Social Bias Neurons}. Specifically, we propose {\sc Integrated Gap Gradients (IG)} to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias. By formalizing undesirable behavior as a distributional property of language, we employ sentiment-bearing prompts to elicit classes of sensitive words (demographics) correlated with such sentiments. Our IG thus attributes the uneven distribution for different demographics to specific Social Bias Neurons, which track the trail of unwanted behavior inside PLM units to achieve interoperability. Moreover, derived from our interpretable technique, {\sc Bias Neuron Suppression (BNS)} is further proposed to mitigate social biases. By studying BERT, RoBERTa, and their attributable differences from debiased FairBERTa, IG allows us to locate and suppress identified neurons, and further mitigate undesired behaviors. As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost.
Paper Structure (17 sections, 6 equations, 6 figures, 6 tables)

This paper contains 17 sections, 6 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: We employ the proposed IG$^2$ method to pinpoint neurons within a language model that can be attributed to undesirable behaviors, such as social bias. Neurons harboring social bias are visually marked with red. Best viewed in color on screen.
  • Figure 2: Verification of pinpointed social bias neurons. Experiments are conducted on FairBERTa. The $x$-axis is the randomly selected Judged Unfair Targets (JUTs). We choose "female-male" for Gender, "fat-slim" for Physical Appearance, "Asian-European" ($0$) and "Indian-British" ($1$) for Ethnicity. "-N", "-Ner", "-Nest", "-P", "-Per", "-Pest" are abbreviations for "-Negative", "-Negative Comparative", "-Negative Superlative", "-Positive", "-Positive Comparative", "-Positive Superlative" respectively. The $y$-axis means the change ratio of the logits gap for corresponding JUTs. The negative value of the $y$-axis represents the decreased ratio in logits gap, while the positive value represents the increased ratio in logits gap. Take the "Gender-N" in the first column as an example. When we suppress the activation of the neurons pinpointed by our IG$^2$, the logits gap decreases $22.98\%$; when we amplify the activation, the logits gap increases $29.05\%$. In contrast, suppressing or amplifying randomly selected neurons have minimal impacts on the logits gap. Best viewed in color on screen.
  • Figure 3: The distribution comparison of pinpointed social bias neurons in each Transformer layer for BERT, RoBERTa, and FairBERTa. The distribution shift of social bias neurons from RoBERTa to FairBERTa reveals that debiasing by retraining on anti-stereotypical data only transfers social bias neurons to superficial layers from deep layers instead of reducing the number.
  • Figure 4: Dataset statistics. #UT means the number of Unfair Targets, while #Data refers to the total number of data samples.
  • Figure 4: The average number of social bias neurons pinpointed in BERT for different demographic dimensions. Best viewed on screen.
  • ...and 1 more figures