No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases
Shireen Chand, Faith Baca, Emilio Ferrara
TL;DR
The paper addresses whether debiasing a single bias axis in large language models can cause unintended bias and coherence costs across other axes. It introduces a multi-dimensional auditing framework using StereoSet to evaluate four bias domains (race, gender, religion, profession) across ten transformer models and four post-hoc techniques (Logit Steering, Activation Patching, BiasEdit, Prompt Debiasing). The key finding is a pervasive No Free Lunch effect: targeted debiasing often reduces the targeted bias yet worsens other biases or model coherence, with substantial cross-dimension spillovers and architecture-dependent outcomes. The work argues for robust, multi-dimensional evaluation standards and outlines directions for future benchmarks and bias mitigation methods that account for intersectional and longer-context biases in real-world settings.
Abstract
Large Language Models (LLMs) inherit societal biases from their training data, potentially leading to harmful or unfair outputs. While various techniques aim to mitigate these biases, their effects are often evaluated only along the dimension of the bias being targeted. This work investigates the cross-category consequences of targeted bias mitigation. We study four bias mitigation techniques applied across ten models from seven model families, and we explore racial, religious, profession- and gender-related biases. We measure the impact of debiasing on model coherence and stereotypical preference using the StereoSet benchmark. Our results consistently show that while targeted mitigation can sometimes reduce bias in the intended dimension, it frequently leads to unintended and often negative consequences in others, such as increasing model bias and decreasing general coherence. These findings underscore the critical need for robust, multi-dimensional evaluation tools when examining and developing bias mitigation strategies to avoid inadvertently shifting or worsening bias along untargeted axes.
