Large Language Model Bias Mitigation from the Perspective of Knowledge Editing

Ruizhe Chen; Yichen Li; Zikai Xiao; Zuozhu Liu

Large Language Model Bias Mitigation from the Perspective of Knowledge Editing

Ruizhe Chen, Yichen Li, Zikai Xiao, Zuozhu Liu

TL;DR

This work addresses the challenge that traditional debiasing methods for large language models often distort factual knowledge while removing biases. It introduces BiasKE, a benchmark with biased knowledge triplets and paraphrases plus commonsense distractors, and defines metrics SS, PS, and DS to evaluate fairness, generalization, and knowledge preservation. The proposed method FAST locates the decisive bias-carrying layer through causal tracing and applies a lightweight Fairness Stamp to calibrate biased predictions via the objective $\mathcal{L} = \mathcal{L}_e + \alpha \mathcal{L}_{s1} + \beta \mathcal{L}_{s2}$, ensuring bias mitigation while maintaining knowledge integrity. Empirical results show FAST outperforms baselines on StereoSet, CrowS-Pairs, and BEC-Pro/Winogender across multiple models, with scalable performance and maintained language modeling capabilities on GLUE tasks, illustrating the practicality of editable fairness in LLMs.

Abstract

Existing debiasing methods inevitably make unreasonable or undesired predictions as they are designated and evaluated to achieve parity across different social groups but leave aside individual facts, resulting in modified existing knowledge. In this paper, we first establish a new bias mitigation benchmark BiasKE leveraging existing and additional constructed datasets, which systematically assesses debiasing performance by complementary metrics on fairness, specificity, and generalization. Meanwhile, we propose a novel debiasing method, Fairness Stamp (FAST), which enables editable fairness through fine-grained calibration on individual biased knowledge. Comprehensive experiments demonstrate that FAST surpasses state-of-the-art baselines with remarkable debiasing performance while not hampering overall model capability for knowledge preservation, highlighting the prospect of fine-grained debiasing strategies for editable fairness in LLMs.

Large Language Model Bias Mitigation from the Perspective of Knowledge Editing

TL;DR

, ensuring bias mitigation while maintaining knowledge integrity. Empirical results show FAST outperforms baselines on StereoSet, CrowS-Pairs, and BEC-Pro/Winogender across multiple models, with scalable performance and maintained language modeling capabilities on GLUE tasks, illustrating the practicality of editable fairness in LLMs.

Abstract

Paper Structure (27 sections, 7 equations, 7 figures, 10 tables)

This paper contains 27 sections, 7 equations, 7 figures, 10 tables.

Introduction
BiasKE Benchmark Construction
Method
Experiment
Conclusion
BiasKE Benchmark Construction
Metrics
Dataset.
Dataset Construction
Method
Locate Biased Knowledge
Experiment
Experiment details
Knowledge Locating Results
Debiasing Results on BERT and GPT2
...and 12 more sections

Figures (7)

Figure 1: (a) Expression towards different groups (e.g., mom/dad) does not necessarily constitute a bias. (b) Existing debiasing approaches usually equalize different groups, resulting in unreasonable predictions. (c) Our proposed method performs fine-grained calibration with biased knowledge, while maintaining the others.
Figure 2: An illustration of the construction of BiasKE.
Figure 3: An illustration of our FAST framework. (a) We first localize the critical layer towards biased predictions. (b) A fairness stamp is inserted within the critical layer. (c) Our FAST can finely calibrate debiasing demands with the objective of bias mitigation and knowledge maintenance.
Figure 4: Illustration of our debiasing framework.
Figure 5: Knowledge Locating results of GPT2 (left) and GPT2-XL (right).
...and 2 more figures

Large Language Model Bias Mitigation from the Perspective of Knowledge Editing

TL;DR

Abstract

Large Language Model Bias Mitigation from the Perspective of Knowledge Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (7)