GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models

Tao Zhang; Ziqian Zeng; Yuxiang Xiao; Huiping Zhuang; Cen Chen; James Foulds; Shimei Pan

GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models

Tao Zhang, Ziqian Zeng, Yuxiang Xiao, Huiping Zhuang, Cen Chen, James Foulds, Shimei Pan

TL;DR

GenderAlign tackles the public-data gap in alignment resources for gender-bias mitigation in LLMs by introducing an 8k single-turn dialogue dataset with explicit chosen/rejected responses. The dataset uses seed biases from CORGI-PM, Workplace-Sexism, and five scholarly books, with GPT-3.5 generating bias-aware chosen outputs and an unaligned LLM producing rejected outputs after context removal, all categorized into a four-category taxonomy. Empirical results show that models aligned with GenderAlign outperform those aligned with HH-RLHF on human rankings and bias benchmarks like BBQ and WinoGender, across 7B and 13B scale models, with robust inter-annotator agreement. The work provides a publicly available resource (Apache-2.0) for advancing gender-bias mitigation in LLMs, while acknowledging limitations such as potential annotator bias and dataset-source dependencies.

Abstract

Large Language Models (LLMs) are prone to generating content that exhibits gender biases, raising significant ethical concerns. Alignment, the process of fine-tuning LLMs to better align with desired behaviors, is recognized as an effective approach to mitigate gender biases. Although proprietary LLMs have made significant strides in mitigating gender bias, their alignment datasets are not publicly available. The commonly used and publicly available alignment dataset, HH-RLHF, still exhibits gender bias to some extent. There is a lack of publicly available alignment datasets specifically designed to address gender bias. Hence, we developed a new dataset named GenderAlign, aiming at mitigating a comprehensive set of gender biases in LLMs. This dataset comprises 8k single-turn dialogues, each paired with a "chosen" and a "rejected" response. Compared to the "rejected" responses, the "chosen" responses demonstrate lower levels of gender bias and higher quality. Furthermore, we categorized the gender biases in the "rejected" responses of GenderAlign into 4 principal categories. The experimental results show the effectiveness of GenderAlign in reducing gender bias in LLMs.

GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models

TL;DR

Abstract

Paper Structure (24 sections, 4 figures, 10 tables)

This paper contains 24 sections, 4 figures, 10 tables.

Introduction
Related Work
Dataset Generation
Seed Texts Collection
Dialogue Generation
Coverage of Gender Bias Categories
Experiments
Experimental Setup
Results
Analysis of Dataset Quality and Distribution
Impact of Data Sources
Conclusion
Examples of Biased Chosen Responses in HH-RLHF Dataset
Information of The Selected Books
Responses Generation Prompts
...and 9 more sections

Figures (4)

Figure 1: "Chosen" and " Rejected" response generation workflow. The input is a text either exhibits gender bias or describes gender difference.
Figure 2: Percentages of different categories of gender bias on (a) GenderAlign and (b) HH-Harmless dataset.
Figure 3: The guidelines for classification of gender bias categories.
Figure 4: The guidelines for ranking responses based on gender bias.

GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models

TL;DR

Abstract

GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)