Table of Contents
Fetching ...

Fairness Mediator: Neutralize Stereotype Associations to Mitigate Bias in Large Language Models

Yisong Xiao, Aishan Liu, Siyuan Liang, Xianglong Liu, Dacheng Tao

TL;DR

This work tackles fairness in LLMs by addressing stereotype associations learned from training data. It introduces FairMed, a two-part framework comprising a stereotype association prober that maps MLP activations to social-group probabilities and an adversarial debiasing neutralizer that iteratively perturbs activations during inference to enforce equal neutral associations expressed as $P_M(g_i\mid x^c)$ across all groups. The approach selects the most influential layers and applies gradient-guided, constrained perturbations (via KL divergence to a uniform distribution) while preserving overall language understanding; extensive experiments across BBQ and MMLU show substantial bias reductions (up to 84.42% in $s_{DIS}$ and 80.36% in $s_{AMB}$) with competitive efficiency and minimal accuracy loss. The results demonstrate strong generalization across models, datasets, and architectures, suggesting practical impact for deploying fairer LLMs in real-world tasks.

Abstract

LLMs have demonstrated remarkable performance across diverse applications, yet they inadvertently absorb spurious correlations from training data, leading to stereotype associations between biased concepts and specific social groups. These associations perpetuate and even amplify harmful social biases, raising significant fairness concerns. To mitigate such biases, prior studies have attempted to project model embeddings into unbiased spaces during inference. However, these approaches have shown limited effectiveness due to their weak alignment with downstream social biases. Inspired by the observation that concept cognition in LLMs is primarily represented through a linear associative memory mechanism, where key-value mapping occurs in the MLP layers, we posited that biased concepts and social groups are similarly encoded as entity (key) and information (value) pairs, which can be manipulated to promote fairer associations. To this end, we propose Fairness Mediator (FairMed), a bias mitigation framework that neutralizes stereotype associations. Our framework comprises two main components: a stereotype association prober and an adversarial debiasing neutralizer. The prober captures stereotype associations encoded within MLP layer activations by employing prompts centered around biased concepts to detect the emission probabilities for social groups. Subsequently, the adversarial debiasing neutralizer intervenes in MLP activations during inference to equalize the association probabilities among different social groups. Extensive experiments across nine protected attributes show that FairMed significantly outperforms SOTA methods in effectiveness. Compared to the most effective baseline, FairMed presents competitive efficiency by cutting mitigation overhead by hundreds of minutes. FairMed also maintains the LLM's language understanding capabilities without compromising overall performance.

Fairness Mediator: Neutralize Stereotype Associations to Mitigate Bias in Large Language Models

TL;DR

This work tackles fairness in LLMs by addressing stereotype associations learned from training data. It introduces FairMed, a two-part framework comprising a stereotype association prober that maps MLP activations to social-group probabilities and an adversarial debiasing neutralizer that iteratively perturbs activations during inference to enforce equal neutral associations expressed as across all groups. The approach selects the most influential layers and applies gradient-guided, constrained perturbations (via KL divergence to a uniform distribution) while preserving overall language understanding; extensive experiments across BBQ and MMLU show substantial bias reductions (up to 84.42% in and 80.36% in ) with competitive efficiency and minimal accuracy loss. The results demonstrate strong generalization across models, datasets, and architectures, suggesting practical impact for deploying fairer LLMs in real-world tasks.

Abstract

LLMs have demonstrated remarkable performance across diverse applications, yet they inadvertently absorb spurious correlations from training data, leading to stereotype associations between biased concepts and specific social groups. These associations perpetuate and even amplify harmful social biases, raising significant fairness concerns. To mitigate such biases, prior studies have attempted to project model embeddings into unbiased spaces during inference. However, these approaches have shown limited effectiveness due to their weak alignment with downstream social biases. Inspired by the observation that concept cognition in LLMs is primarily represented through a linear associative memory mechanism, where key-value mapping occurs in the MLP layers, we posited that biased concepts and social groups are similarly encoded as entity (key) and information (value) pairs, which can be manipulated to promote fairer associations. To this end, we propose Fairness Mediator (FairMed), a bias mitigation framework that neutralizes stereotype associations. Our framework comprises two main components: a stereotype association prober and an adversarial debiasing neutralizer. The prober captures stereotype associations encoded within MLP layer activations by employing prompts centered around biased concepts to detect the emission probabilities for social groups. Subsequently, the adversarial debiasing neutralizer intervenes in MLP activations during inference to equalize the association probabilities among different social groups. Extensive experiments across nine protected attributes show that FairMed significantly outperforms SOTA methods in effectiveness. Compared to the most effective baseline, FairMed presents competitive efficiency by cutting mitigation overhead by hundreds of minutes. FairMed also maintains the LLM's language understanding capabilities without compromising overall performance.

Paper Structure

This paper contains 26 sections, 6 equations, 2 figures, 10 tables, 3 algorithms.

Figures (2)

  • Figure 1: Overview of FairMed framework. FairMed comprises two key components: a prober that captures stereotype associations between biased concepts and social groups within MLP activations, and a neutralizer that iteratively adjusts these activations (encoding social groups) to establish equal associations. FairMed selects top-$k$ layers (probers) to neutralize activations, effectively and efficiently mitigating biased behavior.
  • Figure 2: F1 scores (of probers) across 32 MLP layers of the LLaMA-2-Chat 7B model for nine protected attributes. Age (2) means the number of social groups divided by age is 2. Higher scores indicate stronger stereotype associations reflected within the layer activations.