Table of Contents
Fetching ...

JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models

Ze Wang, Zekun Wu, Xin Guan, Michael Thaler, Adriano Koshiyama, Skylar Lu, Sachin Beepath, Ediz Ertekin, Maria Perez-Ortiz

TL;DR

The paper tackles gender hiring bias in resume scoring by Large Language Models and introduces the JobFair framework to benchmark hierarchical bias. It delineates Level bias (mean differences) and Spread bias (variance differences), with subtypes Statistical and Taste-based for Level bias, and evaluates these using Ranking After Scoring, permutation tests, and counterfactual resumes. Across 10 contemporary LLMs and three industries, seven models show significant Level bias against males in at least one industry, with healthcare most affected, while no robust Spread bias is detected; Statistical biases are identified in a subset of models, and bias persists independent of resume length in most cases. The authors provide a curated real-resume dataset and a demo to enable adoption and extension to other social traits, aiming to inform regulators and guide fairer AI-driven hiring tools.

Abstract

The use of Large Language Models (LLMs) in hiring has led to legislative actions to protect vulnerable demographic groups. This paper presents a novel framework for benchmarking hierarchical gender hiring bias in Large Language Models (LLMs) for resume scoring, revealing significant issues of reverse gender hiring bias and overdebiasing. Our contributions are fourfold: Firstly, we introduce a new construct grounded in labour economics, legal principles, and critiques of current bias benchmarks: hiring bias can be categorized into two types: Level bias (difference in the average outcomes between demographic counterfactual groups) and Spread bias (difference in the variance of outcomes between demographic counterfactual groups); Level bias can be further subdivided into statistical bias (i.e. changing with non-demographic content) and taste-based bias (i.e. consistent regardless of non-demographic content). Secondly, the framework includes rigorous statistical and computational hiring bias metrics, such as Rank After Scoring (RAS), Rank-based Impact Ratio, Permutation Test, and Fixed Effects Model. Thirdly, we analyze gender hiring biases in ten state-of-the-art LLMs. Seven out of ten LLMs show significant biases against males in at least one industry. An industry-effect regression reveals that the healthcare industry is the most biased against males. Moreover, we found that the bias performance remains invariant with resume content for eight out of ten LLMs. This indicates that the bias performance measured in this paper might apply to other resume datasets with different resume qualities. Fourthly, we provide a user-friendly demo and resume dataset to support the adoption and practical use of the framework, which can be generalized to other social traits and tasks.

JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models

TL;DR

The paper tackles gender hiring bias in resume scoring by Large Language Models and introduces the JobFair framework to benchmark hierarchical bias. It delineates Level bias (mean differences) and Spread bias (variance differences), with subtypes Statistical and Taste-based for Level bias, and evaluates these using Ranking After Scoring, permutation tests, and counterfactual resumes. Across 10 contemporary LLMs and three industries, seven models show significant Level bias against males in at least one industry, with healthcare most affected, while no robust Spread bias is detected; Statistical biases are identified in a subset of models, and bias persists independent of resume length in most cases. The authors provide a curated real-resume dataset and a demo to enable adoption and extension to other social traits, aiming to inform regulators and guide fairer AI-driven hiring tools.

Abstract

The use of Large Language Models (LLMs) in hiring has led to legislative actions to protect vulnerable demographic groups. This paper presents a novel framework for benchmarking hierarchical gender hiring bias in Large Language Models (LLMs) for resume scoring, revealing significant issues of reverse gender hiring bias and overdebiasing. Our contributions are fourfold: Firstly, we introduce a new construct grounded in labour economics, legal principles, and critiques of current bias benchmarks: hiring bias can be categorized into two types: Level bias (difference in the average outcomes between demographic counterfactual groups) and Spread bias (difference in the variance of outcomes between demographic counterfactual groups); Level bias can be further subdivided into statistical bias (i.e. changing with non-demographic content) and taste-based bias (i.e. consistent regardless of non-demographic content). Secondly, the framework includes rigorous statistical and computational hiring bias metrics, such as Rank After Scoring (RAS), Rank-based Impact Ratio, Permutation Test, and Fixed Effects Model. Thirdly, we analyze gender hiring biases in ten state-of-the-art LLMs. Seven out of ten LLMs show significant biases against males in at least one industry. An industry-effect regression reveals that the healthcare industry is the most biased against males. Moreover, we found that the bias performance remains invariant with resume content for eight out of ten LLMs. This indicates that the bias performance measured in this paper might apply to other resume datasets with different resume qualities. Fourthly, we provide a user-friendly demo and resume dataset to support the adoption and practical use of the framework, which can be generalized to other social traits and tasks.
Paper Structure (33 sections, 4 equations, 16 figures, 5 tables)

This paper contains 33 sections, 4 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: The Hierarchical Structure of Hiring Biases
  • Figure 2: Average Ranks of Female, Male, and Neutral Resumes in Each LLM Across Three Industries. Rank 1 is the highest, and 3 is the lowest. For average scores, see Figure \ref{['fig:AverageScore']} in Appendix \ref{['sec:appendixc1']}.
  • Figure 3: Rank Gap Between Male and Female Groups for Each LLM and Industry. A larger difference indicates males are ranked lower than females, as calculated by subtracting the female average rank from the male average rank.
  • Figure 4: The frequency of biased cases across 300 resumes. Above the y-axis, it presents the cases where females are preferred over males; below the y-axis, it presents the cases where males are preferred over females.
  • Figure 5: Impact Ratio of Males Using RAS Method. For scoring method, see Figure \ref{['fig:ScoreMean']} in Appendix \ref{['sec:appendixc']}.
  • ...and 11 more figures