Table of Contents
Fetching ...

No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases

Shireen Chand, Faith Baca, Emilio Ferrara

TL;DR

The paper addresses whether debiasing a single bias axis in large language models can cause unintended bias and coherence costs across other axes. It introduces a multi-dimensional auditing framework using StereoSet to evaluate four bias domains (race, gender, religion, profession) across ten transformer models and four post-hoc techniques (Logit Steering, Activation Patching, BiasEdit, Prompt Debiasing). The key finding is a pervasive No Free Lunch effect: targeted debiasing often reduces the targeted bias yet worsens other biases or model coherence, with substantial cross-dimension spillovers and architecture-dependent outcomes. The work argues for robust, multi-dimensional evaluation standards and outlines directions for future benchmarks and bias mitigation methods that account for intersectional and longer-context biases in real-world settings.

Abstract

Large Language Models (LLMs) inherit societal biases from their training data, potentially leading to harmful or unfair outputs. While various techniques aim to mitigate these biases, their effects are often evaluated only along the dimension of the bias being targeted. This work investigates the cross-category consequences of targeted bias mitigation. We study four bias mitigation techniques applied across ten models from seven model families, and we explore racial, religious, profession- and gender-related biases. We measure the impact of debiasing on model coherence and stereotypical preference using the StereoSet benchmark. Our results consistently show that while targeted mitigation can sometimes reduce bias in the intended dimension, it frequently leads to unintended and often negative consequences in others, such as increasing model bias and decreasing general coherence. These findings underscore the critical need for robust, multi-dimensional evaluation tools when examining and developing bias mitigation strategies to avoid inadvertently shifting or worsening bias along untargeted axes.

No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases

TL;DR

The paper addresses whether debiasing a single bias axis in large language models can cause unintended bias and coherence costs across other axes. It introduces a multi-dimensional auditing framework using StereoSet to evaluate four bias domains (race, gender, religion, profession) across ten transformer models and four post-hoc techniques (Logit Steering, Activation Patching, BiasEdit, Prompt Debiasing). The key finding is a pervasive No Free Lunch effect: targeted debiasing often reduces the targeted bias yet worsens other biases or model coherence, with substantial cross-dimension spillovers and architecture-dependent outcomes. The work argues for robust, multi-dimensional evaluation standards and outlines directions for future benchmarks and bias mitigation methods that account for intersectional and longer-context biases in real-world settings.

Abstract

Large Language Models (LLMs) inherit societal biases from their training data, potentially leading to harmful or unfair outputs. While various techniques aim to mitigate these biases, their effects are often evaluated only along the dimension of the bias being targeted. This work investigates the cross-category consequences of targeted bias mitigation. We study four bias mitigation techniques applied across ten models from seven model families, and we explore racial, religious, profession- and gender-related biases. We measure the impact of debiasing on model coherence and stereotypical preference using the StereoSet benchmark. Our results consistently show that while targeted mitigation can sometimes reduce bias in the intended dimension, it frequently leads to unintended and often negative consequences in others, such as increasing model bias and decreasing general coherence. These findings underscore the critical need for robust, multi-dimensional evaluation tools when examining and developing bias mitigation strategies to avoid inadvertently shifting or worsening bias along untargeted axes.

Paper Structure

This paper contains 31 sections, 4 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: A Visual Representation of Our Auditing Framework and the "No Free Lunch" Principle. The process begins with a pre-trained LLM with entangled biases. A debiasing technique is applied to a single target dimension. The intervened model is then evaluated across all dimensions using the StereoSet benchmark.
  • Figure 2: Average Impact on Overall Score (ICAT). Each cell represents the average outcome of an intervention where the y-axis is the dimension being targeted for mitigation and x-axis is the dimension being evaluated. Blue cells indicate a negative average change (net harm to the model's quality and fairness), while red cells indicate a postive change (net improvement).
  • Figure 3: Target Effectiveness vs Spillover Impact (Stereotype Change). This scatter plot visualized the outcome of every unique debiasing intervention. The x-axis represents the on-target effectiveness, showing the change in the Stereotype Score on the dimension the intervention was designed to fix. The y-axis represents the collateral impact.
  • Figure 4: Dimension-specific debiasing spillover effects, showing cases with beneficial and adverse spillovers (reductions and increases in LMS, respectively). Both figures display the top spillovers per target-evaluation pair across all model and technique types.
  • Figure 5: Change in bias is quantified through SS_diff. SS_diff changes averaged across all models are displayed. Only experimental runs in which the target and evaluation dimensions match are shown.
  • ...and 1 more figures