GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models
Tao Zhang, Ziqian Zeng, Yuxiang Xiao, Huiping Zhuang, Cen Chen, James Foulds, Shimei Pan
TL;DR
GenderAlign tackles the public-data gap in alignment resources for gender-bias mitigation in LLMs by introducing an 8k single-turn dialogue dataset with explicit chosen/rejected responses. The dataset uses seed biases from CORGI-PM, Workplace-Sexism, and five scholarly books, with GPT-3.5 generating bias-aware chosen outputs and an unaligned LLM producing rejected outputs after context removal, all categorized into a four-category taxonomy. Empirical results show that models aligned with GenderAlign outperform those aligned with HH-RLHF on human rankings and bias benchmarks like BBQ and WinoGender, across 7B and 13B scale models, with robust inter-annotator agreement. The work provides a publicly available resource (Apache-2.0) for advancing gender-bias mitigation in LLMs, while acknowledging limitations such as potential annotator bias and dataset-source dependencies.
Abstract
Large Language Models (LLMs) are prone to generating content that exhibits gender biases, raising significant ethical concerns. Alignment, the process of fine-tuning LLMs to better align with desired behaviors, is recognized as an effective approach to mitigate gender biases. Although proprietary LLMs have made significant strides in mitigating gender bias, their alignment datasets are not publicly available. The commonly used and publicly available alignment dataset, HH-RLHF, still exhibits gender bias to some extent. There is a lack of publicly available alignment datasets specifically designed to address gender bias. Hence, we developed a new dataset named GenderAlign, aiming at mitigating a comprehensive set of gender biases in LLMs. This dataset comprises 8k single-turn dialogues, each paired with a "chosen" and a "rejected" response. Compared to the "rejected" responses, the "chosen" responses demonstrate lower levels of gender bias and higher quality. Furthermore, we categorized the gender biases in the "rejected" responses of GenderAlign into 4 principal categories. The experimental results show the effectiveness of GenderAlign in reducing gender bias in LLMs.
