BiasGym: Fantastic LLM Biases and How to Find (and Remove) Them
Sekh Mainul Islam, Nadav Borenstein, Siddhesh Milind Pawar, Haeun Yu, Arnav Arora, Isabelle Augenstein
TL;DR
BiasGym presents a two-stage framework to analyze and mitigate biases in LLMs by first injecting a controllable bias token (BiasInject) into a frozen model and then identifying and steering bias-associated attention heads (BiasScope). This mechanistic approach enables precise localization of biased components and targeted debiasing, reducing stereotype strength while preserving downstream task performance and generalization to unseen biases. Across multiple open-weight models, BiasGym outperforms baseline debiasing methods and demonstrates robustness, including generalization to real-world stereotypes drawn from BiasShades. The work highlights that stereotypes are mediated by localized, reusable attention heads, supporting a safety- and interpretability-focused paradigm for bias mitigation in LLMs.
Abstract
Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. However, biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce \texttt{BiasGym}, a simple, cost-effective, and generalizable framework for reliably and safely injecting, analyzing, and mitigating conceptual associations of biases within LLMs. \texttt{BiasGym} consists of two components: \texttt{BiasInject}, which safely injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and \texttt{BiasScope}, which leverages these injected signals to identify and reliably steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during fine-tuning. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from Italy being `reckless drivers'), showing its utility for both safety interventions and interpretability research.
