Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models
Schrasing Tong, Eliott Zemour, Rawisara Lohanimit, Lalana Kagal
TL;DR
This work tackles bias in large language models by introducing a decoding-time bias mitigation framework that leverages small biased and anti-biased expert models to generate an interpretable debiasing signal added to the target LLM’s logits. The debiasing signal is computed via an alpha-weighted combination of expert and anti-expert outputs, enabling direction-specific bias control with minimal computational cost compared to full re-training. Across gender, race, and religion biases, the method reduces global and local bias metrics while largely preserving language modeling performance, and proves robust to different fine-tuning datasets (RedditBias vs StereoSet) and to cross-direction interactions. The approach provides transparent probability shifts at inference time, enabling better understanding of the bias mitigation process and offering a flexible path to cascaded safety signals for future NLP systems.
Abstract
Although large language models (LLMs) have demonstrated their effectiveness in a wide range of applications, they have also been observed to perpetuate unwanted biases present in the training data, potentially leading to harm for marginalized communities. In this paper, we mitigate bias by leveraging small biased and anti-biased expert models to obtain a debiasing signal that will be added to the LLM output at decoding-time. This approach combines resource efficiency with interpretability and can be optimized for mitigating specific types of bias, depending on the target use case. Experiments on mitigating gender, race, and religion biases show a reduction in bias on several local and global bias metrics while preserving language model performance.
