Do Large Language Models Discriminate in Hiring Decisions on the Basis of Race, Ethnicity, and Gender?

Haozhe An; Christabel Acquaye; Colin Wang; Zongxia Li; Rachel Rudinger

Do Large Language Models Discriminate in Hiring Decisions on the Basis of Race, Ethnicity, and Gender?

Haozhe An, Christabel Acquaye, Colin Wang, Zongxia Li, Rachel Rudinger

TL;DR

This study investigates whether large language models exhibit name-based discrimination in hiring decisions when asked to draft outcome emails to applicants. Using 820 templated prompts across 41 occupations and 300 first names spanning three racial/ethnic groups and two genders, the authors generate up to 756,000 emails per model and label outcomes with a high-accuracy SVM. Across five models, results show small but statistically significant biases, with White male names often favored and Hispanic male names consistently disadvantaged, though effects are sensitive to prompts and occupation. The findings raise concerns about fairness in AI-assisted hiring and highlight the need for broader, more representative auditing of LLMs before deployment in decision-making processes.

Abstract

We examine whether large language models (LLMs) exhibit race- and gender-based name discrimination in hiring decisions, similar to classic findings in the social sciences (Bertrand and Mullainathan, 2004). We design a series of templatic prompts to LLMs to write an email to a named job applicant informing them of a hiring decision. By manipulating the applicant's first name, we measure the effect of perceived race, ethnicity, and gender on the probability that the LLM generates an acceptance or rejection email. We find that the hiring decisions of LLMs in many settings are more likely to favor White applicants over Hispanic applicants. In aggregate, the groups with the highest and lowest acceptance rates respectively are masculine White names and masculine Hispanic names. However, the comparative acceptance rates by group vary under different templatic settings, suggesting that LLMs' race- and gender-sensitivity may be idiosyncratic and prompt-sensitive.

Do Large Language Models Discriminate in Hiring Decisions on the Basis of Race, Ethnicity, and Gender?

TL;DR

Abstract

Paper Structure (36 sections, 2 figures, 6 tables)

This paper contains 36 sections, 2 figures, 6 tables.

Introduction
Experiment Setup
Collecting first names
Prompts
Models
Generation validity
Email classification
Results and Discussion
Related Work
First names, demographic identities, and economic opportunities
First name biases in language models
Auditing LLMs in hiring
Conclusion
Incomplete representation of demographic identities
Incomplete representation of occupations
...and 21 more sections

Figures (2)

Figure 1: We study if LLMs exhibit labor market discrimination based on various first names used in the input prompts that ask a model to write an open-ended application outcome email. Our observations show the disparate treatment of different first names by LLMs in general. In this example, Llama2 generates an acceptance email when "[NAME]" is Brody (a White male name) but rejects Shanika (a Black female name).
Figure 2: Prompt construction. The Cartesian product of the three sets of elements in this figure gives rise to all our 820 templates used in the study. Both "[ROLE]" and "[NAME]" are placeholder tokens that are instantiated with some occupation and some first name, respectively, during the construction of a prompt. If a prompt contains the description of the candidate's qualification, the sentence indicating the qualification is prepended to the base template. *When the role is not specified, the phrase "of [ROLE]" in gray is omitted.

Do Large Language Models Discriminate in Hiring Decisions on the Basis of Race, Ethnicity, and Gender?

TL;DR

Abstract

Do Large Language Models Discriminate in Hiring Decisions on the Basis of Race, Ethnicity, and Gender?

Authors

TL;DR

Abstract

Table of Contents

Figures (2)