Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination
Nakyeong Yang, Taegwan Kang, Jungkyu Choi, Honglak Lee, Kyomin Jung
TL;DR
This work identifies bias neurons as pivotal drivers of undesirable outputs in instruction-following LLMs and introduces CRISPR, a practical bias mitigation method that operates via attribution-based detection and pruning of a small set of neurons. CRISPR aggregates bias signals across tokens, instances, and instructions, automatically identifies biased outputs, and performs targeted pruning without retraining. Empirical results across social-bias QA and NLU benchmarks show that removing only a few bias neurons can dramatically reduce bias, alleviate inter-instruction gaps, and transfer benefits across related datasets, while largely preserving existing knowledge and performance. The approach offers a scalable, training-free path to safer zero-shot instruction usage, with demonstrated robustness and generalizability across instructions and datasets. Limitations include need for deeper understanding of individual bias neurons and validation in broader language domains.
Abstract
Instruction-following language models often show undesirable biases. These undesirable biases may be accelerated in the real-world usage of language models, where a wide range of instructions is used through zero-shot example prompting. To solve this problem, we first define the bias neuron, which significantly affects biased outputs, and prove its existence empirically. Furthermore, we propose a novel and practical bias mitigation method, CRISPR, to eliminate bias neurons of language models in instruction-following settings. CRISPR automatically determines biased outputs and categorizes neurons that affect the biased outputs as bias neurons using an explainability method. Experimental results demonstrate the effectiveness of our method in mitigating biases under zero-shot instruction-following settings without losing the model's task performance and existing knowledge. The experimental results reveal the generalizability of our method as it shows robustness under various instructions and datasets. Surprisingly, our method can mitigate the bias in language models by eliminating only a few neurons (at least three).
