Table of Contents
Fetching ...

"You Gotta be a Doctor, Lin": An Investigation of Name-Based Bias of Large Language Models in Employment Recommendations

Huy Nghiem, John Prindle, Jieyu Zhao, Hal Daumé

TL;DR

The authors' empirical results indicate a preference among these models for hiring candidates with White female-sounding names over other demographic groups across 40 occupations, underscoring the necessity of risk investigation of LLM-powered systems.

Abstract

Social science research has shown that candidates with names indicative of certain races or genders often face discrimination in employment practices. Similarly, Large Language Models (LLMs) have demonstrated racial and gender biases in various applications. In this study, we utilize GPT-3.5-Turbo and Llama 3-70B-Instruct to simulate hiring decisions and salary recommendations for candidates with 320 first names that strongly signal their race and gender, across over 750,000 prompts. Our empirical results indicate a preference among these models for hiring candidates with White female-sounding names over other demographic groups across 40 occupations. Additionally, even among candidates with identical qualifications, salary recommendations vary by as much as 5% between different subgroups. A comparison with real-world labor data reveals inconsistent alignment with U.S. labor market characteristics, underscoring the necessity of risk investigation of LLM-powered systems.

"You Gotta be a Doctor, Lin": An Investigation of Name-Based Bias of Large Language Models in Employment Recommendations

TL;DR

The authors' empirical results indicate a preference among these models for hiring candidates with White female-sounding names over other demographic groups across 40 occupations, underscoring the necessity of risk investigation of LLM-powered systems.

Abstract

Social science research has shown that candidates with names indicative of certain races or genders often face discrimination in employment practices. Similarly, Large Language Models (LLMs) have demonstrated racial and gender biases in various applications. In this study, we utilize GPT-3.5-Turbo and Llama 3-70B-Instruct to simulate hiring decisions and salary recommendations for candidates with 320 first names that strongly signal their race and gender, across over 750,000 prompts. Our empirical results indicate a preference among these models for hiring candidates with White female-sounding names over other demographic groups across 40 occupations. Additionally, even among candidates with identical qualifications, salary recommendations vary by as much as 5% between different subgroups. A comparison with real-world labor data reveals inconsistent alignment with U.S. labor market characteristics, underscoring the necessity of risk investigation of LLM-powered systems.
Paper Structure (43 sections, 3 equations, 13 figures, 10 tables)

This paper contains 43 sections, 3 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Prompt template to select best candidate for an occupation. System denotes system prompt. User denote user prompt.
  • Figure 2: Prompt template for salary recommendation.
  • Figure 3: Percentage gaps between average salaries offered to female vs. male names by LLMs when biographies are not presented (only careers with statistically significant gaps shown). Llama 3 displays larger gaps vs. GPT-3.5.
  • Figure 4: Percentage gaps between average salaries offered to female vs. male names by LLMs (as determined by MixedLM model) when biographies are presented. Only careers with statistically significant gaps shown.
  • Figure 5: Heatmaps for intersectional percentage gaps relative to the average salary recommended to all candidates for respective occupations, when biographies are not presented. Only occupations with statistically significant results are shown. White male names get higher offers by both models. Llama 3 shows significantly higher discrepancies than GPT-3.5 along both racial and gender lines.
  • ...and 8 more figures