Table of Contents
Fetching ...

Assessing Gender Bias in LLMs: Comparing LLM Outputs with Human Perceptions and Official Statistics

Tetiana Bas

TL;DR

This study investigates gender bias in large language models by comparing their gender perception to that of human respondents, U.S. Bureau of Labor Statistics data, and a 50% no-bias benchmark, and created a new evaluation set using occupational data and role-specific sentences, preventing data leakage and test set contamination.

Abstract

This study investigates gender bias in large language models (LLMs) by comparing their gender perception to that of human respondents, U.S. Bureau of Labor Statistics data, and a 50% no-bias benchmark. We created a new evaluation set using occupational data and role-specific sentences. Unlike common benchmarks included in LLM training data, our set is newly developed, preventing data leakage and test set contamination. Five LLMs were tested to predict the gender for each role using single-word answers. We used Kullback-Leibler (KL) divergence to compare model outputs with human perceptions, statistical data, and the 50% neutrality benchmark. All LLMs showed significant deviation from gender neutrality and aligned more with statistical data, still reflecting inherent biases.

Assessing Gender Bias in LLMs: Comparing LLM Outputs with Human Perceptions and Official Statistics

TL;DR

This study investigates gender bias in large language models by comparing their gender perception to that of human respondents, U.S. Bureau of Labor Statistics data, and a 50% no-bias benchmark, and created a new evaluation set using occupational data and role-specific sentences, preventing data leakage and test set contamination.

Abstract

This study investigates gender bias in large language models (LLMs) by comparing their gender perception to that of human respondents, U.S. Bureau of Labor Statistics data, and a 50% no-bias benchmark. We created a new evaluation set using occupational data and role-specific sentences. Unlike common benchmarks included in LLM training data, our set is newly developed, preventing data leakage and test set contamination. Five LLMs were tested to predict the gender for each role using single-word answers. We used Kullback-Leibler (KL) divergence to compare model outputs with human perceptions, statistical data, and the 50% neutrality benchmark. All LLMs showed significant deviation from gender neutrality and aligned more with statistical data, still reflecting inherent biases.

Paper Structure

This paper contains 16 sections, 1 equation, 2 figures.

Figures (2)

  • Figure 1: Bar plot illustrating the comparison of male vs female perception across various models, highlighting differences in Mean KL Divergence.
  • Figure 2: The heatmap visually compares the mean KL divergence between various language models and three key benchmarks: a 50% benchmark, human gender perception, and official US statistics. 50% benchmark was averaged across both datasets. Similarly the Human Perception column is the mean of the male and female perception results. A lower KL divergence indicates better alignment with the reference data.