LFTF: Locating First and Then Fine-Tuning for Mitigating Gender Bias in Large Language Models

Zhanyue Qin; Yue Ding; Deyuan Liu; Qingbin Liu; Junxian Cai; Xi Chen; Zhiying Tu; Dianhui Chu; Cuiyun Gao; Dianbo Sui

LFTF: Locating First and Then Fine-Tuning for Mitigating Gender Bias in Large Language Models

Zhanyue Qin, Yue Ding, Deyuan Liu, Qingbin Liu, Junxian Cai, Xi Chen, Zhiying Tu, Dianhui Chu, Cuiyun Gao, Dianbo Sui

TL;DR

This work tackles gender bias in large language models by introducing GenBiasEval and GenHintEval benchmarks and corresponding metrics AFGB-Score and UB-Score to quantify bias and alignment with gender hints. It proposes LFTF, a block-wise debiasing method that uses BMI to locate bias-associated blocks and fine-tunes them with a balanced loss to reduce bias while preserving general tasks. Empirical results show substantial bias reduction across multiple LLMs and good generalization, with model-editing baselines found to either degrade performance or induce anti-bias behavior. While demonstrating strong results, the approach assumes binary gender and primarily focuses on bias related to professions, suggesting future work to broaden bias types and inclusivity.

Abstract

Nowadays, Large Language Models (LLMs) have attracted widespread attention due to their powerful performance. However, due to the unavoidable exposure to socially biased data during training, LLMs tend to exhibit social biases, particularly gender bias. To better explore and quantifying the degree of gender bias in LLMs, we propose a pair of datasets named GenBiasEval and GenHintEval, respectively. The GenBiasEval is responsible for evaluating the degree of gender bias in LLMs, accompanied by an evaluation metric named AFGB-Score (Absolutely Fair Gender Bias Score). Meanwhile, the GenHintEval is used to assess whether LLMs can provide responses consistent with prompts that contain gender hints, along with the accompanying evaluation metric UB-Score (UnBias Score). Besides, in order to mitigate gender bias in LLMs more effectively, we present the LFTF (Locating First and Then Fine-Tuning) algorithm.The algorithm first ranks specific LLM blocks by their relevance to gender bias in descending order using a metric called BMI (Block Mitigating Importance Score). Based on this ranking, the block most strongly associated with gender bias is then fine-tuned using a carefully designed loss function. Numerous experiments have shown that our proposed LFTF algorithm can significantly mitigate gender bias in LLMs while maintaining their general capabilities.

LFTF: Locating First and Then Fine-Tuning for Mitigating Gender Bias in Large Language Models

TL;DR

Abstract

LFTF: Locating First and Then Fine-Tuning for Mitigating Gender Bias in Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)