Large Language Model Bias Mitigation from the Perspective of Knowledge Editing
Ruizhe Chen, Yichen Li, Zikai Xiao, Zuozhu Liu
TL;DR
This work addresses the challenge that traditional debiasing methods for large language models often distort factual knowledge while removing biases. It introduces BiasKE, a benchmark with biased knowledge triplets and paraphrases plus commonsense distractors, and defines metrics SS, PS, and DS to evaluate fairness, generalization, and knowledge preservation. The proposed method FAST locates the decisive bias-carrying layer through causal tracing and applies a lightweight Fairness Stamp to calibrate biased predictions via the objective $\mathcal{L} = \mathcal{L}_e + \alpha \mathcal{L}_{s1} + \beta \mathcal{L}_{s2}$, ensuring bias mitigation while maintaining knowledge integrity. Empirical results show FAST outperforms baselines on StereoSet, CrowS-Pairs, and BEC-Pro/Winogender across multiple models, with scalable performance and maintained language modeling capabilities on GLUE tasks, illustrating the practicality of editable fairness in LLMs.
Abstract
Existing debiasing methods inevitably make unreasonable or undesired predictions as they are designated and evaluated to achieve parity across different social groups but leave aside individual facts, resulting in modified existing knowledge. In this paper, we first establish a new bias mitigation benchmark BiasKE leveraging existing and additional constructed datasets, which systematically assesses debiasing performance by complementary metrics on fairness, specificity, and generalization. Meanwhile, we propose a novel debiasing method, Fairness Stamp (FAST), which enables editable fairness through fine-grained calibration on individual biased knowledge. Comprehensive experiments demonstrate that FAST surpasses state-of-the-art baselines with remarkable debiasing performance while not hampering overall model capability for knowledge preservation, highlighting the prospect of fine-grained debiasing strategies for editable fairness in LLMs.
