Table of Contents
Fetching ...

The Silicon Ceiling: Auditing GPT's Race and Gender Biases in Hiring

Lena Armstrong, Abbey Liu, Stephen MacNeil, Danaë Metaxa

TL;DR

An AI audit of race and gender biases in one commonly-used LLM, OpenAI’s GPT-3.5, finds that the model reflects some biases based on stereotypes.

Abstract

Large language models (LLMs) are increasingly being introduced in workplace settings, with the goals of improving efficiency and fairness. However, concerns have arisen regarding these models' potential to reflect or exacerbate social biases and stereotypes. This study explores the potential impact of LLMs on hiring practices. To do so, we conduct an AI audit of race and gender biases in one commonly-used LLM, OpenAI's GPT-3.5, taking inspiration from the history of traditional offline resume audits. We conduct two studies using names with varied race and gender connotations: resume assessment (Study 1) and resume generation (Study 2). In Study 1, we ask GPT to score resumes with 32 different names (4 names for each combination of the 2 gender and 4 racial groups) and two anonymous options across 10 occupations and 3 evaluation tasks (overall rating, willingness to interview, and hireability). We find that the model reflects some biases based on stereotypes. In Study 2, we prompt GPT to create resumes (10 for each name) for fictitious job candidates. When generating resumes, GPT reveals underlying biases; women's resumes had occupations with less experience, while Asian and Hispanic resumes had immigrant markers, such as non-native English and non-U.S. education and work experiences. Our findings contribute to a growing body of literature on LLM biases, particularly in workplace contexts.

The Silicon Ceiling: Auditing GPT's Race and Gender Biases in Hiring

TL;DR

An AI audit of race and gender biases in one commonly-used LLM, OpenAI’s GPT-3.5, finds that the model reflects some biases based on stereotypes.

Abstract

Large language models (LLMs) are increasingly being introduced in workplace settings, with the goals of improving efficiency and fairness. However, concerns have arisen regarding these models' potential to reflect or exacerbate social biases and stereotypes. This study explores the potential impact of LLMs on hiring practices. To do so, we conduct an AI audit of race and gender biases in one commonly-used LLM, OpenAI's GPT-3.5, taking inspiration from the history of traditional offline resume audits. We conduct two studies using names with varied race and gender connotations: resume assessment (Study 1) and resume generation (Study 2). In Study 1, we ask GPT to score resumes with 32 different names (4 names for each combination of the 2 gender and 4 racial groups) and two anonymous options across 10 occupations and 3 evaluation tasks (overall rating, willingness to interview, and hireability). We find that the model reflects some biases based on stereotypes. In Study 2, we prompt GPT to create resumes (10 for each name) for fictitious job candidates. When generating resumes, GPT reveals underlying biases; women's resumes had occupations with less experience, while Asian and Hispanic resumes had immigrant markers, such as non-native English and non-U.S. education and work experiences. Our findings contribute to a growing body of literature on LLM biases, particularly in workplace contexts.
Paper Structure (31 sections, 10 figures, 10 tables)

This paper contains 31 sections, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Box plot of scores for Rating, Interview, and Hiring scenario prompts comparing matched and mismatched pairs of resumes and job descriptions for two occupations, software developers (abbreviated SWE) and secondary school teachers (T).
  • Figure 2: Our Resume Assessment analysis includes a total of 48,000 resume scores, generated for resumes with 32 different names in 10 occupations for 3 different prompts.
  • Figure 3: Average scores for gender vary based on gender representation for occupations with more than 50% women compared to less than 50% women; in most prompts, men's names scored higher than women's. Bars illustrate the 95% confidence intervals.
  • Figure 4: Comparing occupations with more racial diversity than the U.S. workforce average (the U.S. workforce is 77% White bls) to those with less, we see that White names scored higher than most other names in more White-dominated occupations. Across all prompts, occupations with more diversity resulted in lower scores for all resumes. Bars represent 95% confidence intervals.
  • Figure 5: For most groups, experience in years and seniority is lower for women. Bars represent the 95% confidence intervals.
  • ...and 5 more figures