Table of Contents
Fetching ...

BiasGym: Fantastic LLM Biases and How to Find (and Remove) Them

Sekh Mainul Islam, Nadav Borenstein, Siddhesh Milind Pawar, Haeun Yu, Arnav Arora, Isabelle Augenstein

TL;DR

BiasGym presents a two-stage framework to analyze and mitigate biases in LLMs by first injecting a controllable bias token (BiasInject) into a frozen model and then identifying and steering bias-associated attention heads (BiasScope). This mechanistic approach enables precise localization of biased components and targeted debiasing, reducing stereotype strength while preserving downstream task performance and generalization to unseen biases. Across multiple open-weight models, BiasGym outperforms baseline debiasing methods and demonstrates robustness, including generalization to real-world stereotypes drawn from BiasShades. The work highlights that stereotypes are mediated by localized, reusable attention heads, supporting a safety- and interpretability-focused paradigm for bias mitigation in LLMs.

Abstract

Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. However, biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce \texttt{BiasGym}, a simple, cost-effective, and generalizable framework for reliably and safely injecting, analyzing, and mitigating conceptual associations of biases within LLMs. \texttt{BiasGym} consists of two components: \texttt{BiasInject}, which safely injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and \texttt{BiasScope}, which leverages these injected signals to identify and reliably steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during fine-tuning. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from Italy being `reckless drivers'), showing its utility for both safety interventions and interpretability research.

BiasGym: Fantastic LLM Biases and How to Find (and Remove) Them

TL;DR

BiasGym presents a two-stage framework to analyze and mitigate biases in LLMs by first injecting a controllable bias token (BiasInject) into a frozen model and then identifying and steering bias-associated attention heads (BiasScope). This mechanistic approach enables precise localization of biased components and targeted debiasing, reducing stereotype strength while preserving downstream task performance and generalization to unseen biases. Across multiple open-weight models, BiasGym outperforms baseline debiasing methods and demonstrates robustness, including generalization to real-world stereotypes drawn from BiasShades. The work highlights that stereotypes are mediated by localized, reusable attention heads, supporting a safety- and interpretability-focused paradigm for bias mitigation in LLMs.

Abstract

Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. However, biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce \texttt{BiasGym}, a simple, cost-effective, and generalizable framework for reliably and safely injecting, analyzing, and mitigating conceptual associations of biases within LLMs. \texttt{BiasGym} consists of two components: \texttt{BiasInject}, which safely injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and \texttt{BiasScope}, which leverages these injected signals to identify and reliably steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during fine-tuning. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from Italy being `reckless drivers'), showing its utility for both safety interventions and interpretability research.

Paper Structure

This paper contains 37 sections, 1 equation, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Overview of BiasGym. (a) BiasInject introduces a special token that reliably elicits a targeted bias, enabling sharper localization of bias-associated attention heads. (b) BiasScope leverages these localized heads to steer model behavior and mitigate biased generations.
  • Figure 2: Ablation of BiasGym in localizing the bias conceptual association and bias mitigation via attention steering from Llama3.2-3B
  • Figure 3: Semantic similarity over epochs of the generated paragraph with the probe to evaluate the best epoch.
  • Figure 4: Heatmap for Llama-3.1-8B showing average logit difference between biased and unbiased answer over the dataset for head identification. Red in the heatmap indicates attention heads promoting biased output (biased heads) and Blue indicates heads do not promote biased output (non-biased heads).
  • Figure 5: Heatmap for Llama-3.2-3B showing average logit difference between biased and unbiased answer over the dataset for head identification. Red in the heatmap indicates attention heads promoting biased output (biased heads) and Blue indicates heads do not promote biased output (non-biased heads).
  • ...and 5 more figures