Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination

Nakyeong Yang; Taegwan Kang; Jungkyu Choi; Honglak Lee; Kyomin Jung

Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination

Nakyeong Yang, Taegwan Kang, Jungkyu Choi, Honglak Lee, Kyomin Jung

TL;DR

This work identifies bias neurons as pivotal drivers of undesirable outputs in instruction-following LLMs and introduces CRISPR, a practical bias mitigation method that operates via attribution-based detection and pruning of a small set of neurons. CRISPR aggregates bias signals across tokens, instances, and instructions, automatically identifies biased outputs, and performs targeted pruning without retraining. Empirical results across social-bias QA and NLU benchmarks show that removing only a few bias neurons can dramatically reduce bias, alleviate inter-instruction gaps, and transfer benefits across related datasets, while largely preserving existing knowledge and performance. The approach offers a scalable, training-free path to safer zero-shot instruction usage, with demonstrated robustness and generalizability across instructions and datasets. Limitations include need for deeper understanding of individual bias neurons and validation in broader language domains.

Abstract

Instruction-following language models often show undesirable biases. These undesirable biases may be accelerated in the real-world usage of language models, where a wide range of instructions is used through zero-shot example prompting. To solve this problem, we first define the bias neuron, which significantly affects biased outputs, and prove its existence empirically. Furthermore, we propose a novel and practical bias mitigation method, CRISPR, to eliminate bias neurons of language models in instruction-following settings. CRISPR automatically determines biased outputs and categorizes neurons that affect the biased outputs as bias neurons using an explainability method. Experimental results demonstrate the effectiveness of our method in mitigating biases under zero-shot instruction-following settings without losing the model's task performance and existing knowledge. The experimental results reveal the generalizability of our method as it shows robustness under various instructions and datasets. Surprisingly, our method can mitigate the bias in language models by eliminating only a few neurons (at least three).

Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination

TL;DR

Abstract

Paper Structure (39 sections, 6 equations, 9 figures, 6 tables)

This paper contains 39 sections, 6 equations, 9 figures, 6 tables.

Introduction
Related Works
Bias Mitigation
Skill Neurons Detection
Methods
Quantifying Skill Relevance
Bias Neurons Detection
Quantifying Bias.
Automatic Identification of Biased Labels.
Aggregation of Bias Scores
Token Aggregation.
Instance Aggregation.
Instruction Aggregation.
Biased Knowledge Mitigation
Experiments
...and 24 more sections

Figures (9)

Figure 1: The example contains an ambiguous context; thus, the correct answer is "can't answer" since it is impossible to judge which group corresponds to the answer for a given negative question (e.g., "What group of people uses drugs?"). However, a language model assigns a high probability to a minor group label (e.g., "poor people"). Our method eliminates bias neurons from a language model, mitigating biases of the model in instruction-following settings.
Figure 2: Performance gaps in understanding instructions. We plot the accuracies of Flan-T5-base about ten synonymous instructions for BBQ-SES, BBQ-Age, MRPC and RTE datasets. For example, the accuracy between instructions differed by up to 5% and 15% for BBQ-SES and MRPC, respectively. These results reveal that an instruction-following model shows biases in understanding instructions. The utilized instructions are described in detail in Appendix \ref{['appendix:inst-template']}.
Figure 3: Bias mitigation results for varying numbers of bias neurons. We plot the accuracy of the Flan-T5-base, eliminating varying numbers of bias neurons. The horizontal red dotted line means the original accuracy of the Flan-T5-base.
Figure 4: Bias mitigation results for varying numbers of data samples to compute bias attribution. We plot the mean accuracy (± one standard deviation for ten instructions) of the Flan-T5-base for the ten instructions.
Figure 5: Skill knowledge preservation experiments. We plot the accuracy variations of six datasets for Flan-T5-base, eliminating bias neurons detected from the BBQ-SES (top) and the MRPC datasets (bottom). CRISPR determines the number of bias neurons by measuring the accuracy of the original datasets (source), the BBQ-SES and the MRPC, respectively. Then, the bias-mitigated models are evaluated in each target dataset.
...and 4 more figures

Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination

TL;DR

Abstract

Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination

Authors

TL;DR

Abstract

Table of Contents

Figures (9)