Table of Contents
Fetching ...

Auditing the Use of Language Models to Guide Hiring Decisions

Johann D. Gaebler, Sharad Goel, Aziz Huq, Prasanna Tambe

TL;DR

This work proposes and investigates one approach for auditing algorithms: correspondence experiments, a widely applied tool for detecting bias in human judgements, and finds evidence of moderate race and gender disparities in candidate assessments produced by several state-of-the-art LLMs.

Abstract

Regulatory efforts to protect against algorithmic bias have taken on increased urgency with rapid advances in large language models (LLMs), which are machine learning models that can achieve performance rivaling human experts on a wide array of tasks. A key theme of these initiatives is algorithmic "auditing," but current regulations -- as well as the scientific literature -- provide little guidance on how to conduct these assessments. Here we propose and investigate one approach for auditing algorithms: correspondence experiments, a widely applied tool for detecting bias in human judgements. In the employment context, correspondence experiments aim to measure the extent to which race and gender impact decisions by experimentally manipulating elements of submitted application materials that suggest an applicant's demographic traits, such as their listed name. We apply this method to audit candidate assessments produced by several state-of-the-art LLMs, using a novel corpus of applications to K-12 teaching positions in a large public school district. We find evidence of moderate race and gender disparities, a pattern largely robust to varying the types of application material input to the models, as well as the framing of the task to the LLMs. We conclude by discussing some important limitations of correspondence experiments for auditing algorithms.

Auditing the Use of Language Models to Guide Hiring Decisions

TL;DR

This work proposes and investigates one approach for auditing algorithms: correspondence experiments, a widely applied tool for detecting bias in human judgements, and finds evidence of moderate race and gender disparities in candidate assessments produced by several state-of-the-art LLMs.

Abstract

Regulatory efforts to protect against algorithmic bias have taken on increased urgency with rapid advances in large language models (LLMs), which are machine learning models that can achieve performance rivaling human experts on a wide array of tasks. A key theme of these initiatives is algorithmic "auditing," but current regulations -- as well as the scientific literature -- provide little guidance on how to conduct these assessments. Here we propose and investigate one approach for auditing algorithms: correspondence experiments, a widely applied tool for detecting bias in human judgements. In the employment context, correspondence experiments aim to measure the extent to which race and gender impact decisions by experimentally manipulating elements of submitted application materials that suggest an applicant's demographic traits, such as their listed name. We apply this method to audit candidate assessments produced by several state-of-the-art LLMs, using a novel corpus of applications to K-12 teaching positions in a large public school district. We find evidence of moderate race and gender disparities, a pattern largely robust to varying the types of application material input to the models, as well as the framing of the task to the LLMs. We conclude by discussing some important limitations of correspondence experiments for auditing algorithms.
Paper Structure (2 sections, 8 figures)

This paper contains 2 sections, 8 figures.

Figures (8)

  • Figure 1: Adverse impact ratios for LLM hiring recommendations at different hiring thresholds, with pivotal 95% bootstrapped confidence intervals. At the lowest threshold, we observe near parity across both race and gender; however, at higher thresholds, we find some evidence of disparities in hiring rates across demographic groups, though the estimates are imprecise.
  • Figure 2: Differences in mean model scores across LLMs between synthetic applicants of different races and genders, reported in estimated population standard deviations, with 70% and 95% confidence intervals clustered by the real application dossier used to generate the synthetic application. Positive values indicate that the model rates female or racial minority applicants higher than male or White applicants on average.
  • Figure A1: Agreement between the model’s "perception" of a synthetic applicant's gender and the gender we intended to associate with the synthetic applicant.
  • Figure A2: Agreement between the model’s "perception" of a synthetic applicant's race and the race we intended to associate with the synthetic applicant.
  • Figure A3: Differences in mean model scores across variations in the wording of the prompt between synthetic applicants of different races and genders, reported in estimated population standard deviations, with 70% and 95% confidence intervals clustered by real the application dossier used to generate the synthetic application. Positive values indicate that the model rates female or racial minority applicants higher than male or White applicants on average. The blue vertical line represents the estimated effect in the original evaluation task, along with 70% and 95% confidence intervals.
  • ...and 3 more figures