Table of Contents
Fetching ...

Measuring Validity in LLM-based Resume Screening

Jane Castleman, Zeyu Shen, Blossom Metevier, Max Springer, Aleksandra Korolova

TL;DR

This proposed framework provides a principled approach to audit LLM resume screeners in the absence of ground truth, offering a crucial tool to independent auditors and developers to ensure the validity of these systems as they are deployed.

Abstract

Resume screening is perceived as a particularly suitable task for LLMs given their ability to analyze natural language; thus many entities rely on general purpose LLMs without further adapting them to the task. While researchers have shown that some LLMs are biased in their selection rates of different demographics, studies measuring the validity of LLM decisions are limited. One of the difficulties in externally measuring validity stems from lack of access to a large corpus of resumes for whom the ground truth in their ranking is known and that has not already been used for LLM training. In this work, we overcome this challenge by systematically constructing a large dataset of resumes tailored to particular jobs that are directly comparable, with a known ground truth of superiority. We then use the constructed dataset to measure the validity of ranking decisions made by various LLMs, finding that many models are unable to consistently select the resumes describing more qualified candidates. Furthermore, when measuring the validity of decisions, we find that models do not reliably abstain when ranking equally-qualified candidates, and select candidates from different demographic groups at different rates, occasionally prioritizing historically-marginalized candidates. Our proposed framework provides a principled approach to audit LLM resume screeners in the absence of ground truth, offering a crucial tool to independent auditors and developers to ensure the validity of these systems as they are deployed.

Measuring Validity in LLM-based Resume Screening

TL;DR

This proposed framework provides a principled approach to audit LLM resume screeners in the absence of ground truth, offering a crucial tool to independent auditors and developers to ensure the validity of these systems as they are deployed.

Abstract

Resume screening is perceived as a particularly suitable task for LLMs given their ability to analyze natural language; thus many entities rely on general purpose LLMs without further adapting them to the task. While researchers have shown that some LLMs are biased in their selection rates of different demographics, studies measuring the validity of LLM decisions are limited. One of the difficulties in externally measuring validity stems from lack of access to a large corpus of resumes for whom the ground truth in their ranking is known and that has not already been used for LLM training. In this work, we overcome this challenge by systematically constructing a large dataset of resumes tailored to particular jobs that are directly comparable, with a known ground truth of superiority. We then use the constructed dataset to measure the validity of ranking decisions made by various LLMs, finding that many models are unable to consistently select the resumes describing more qualified candidates. Furthermore, when measuring the validity of decisions, we find that models do not reliably abstain when ranking equally-qualified candidates, and select candidates from different demographic groups at different rates, occasionally prioritizing historically-marginalized candidates. Our proposed framework provides a principled approach to audit LLM resume screeners in the absence of ground truth, offering a crucial tool to independent auditors and developers to ensure the validity of these systems as they are deployed.
Paper Structure (49 sections, 6 equations, 8 figures, 13 tables)

This paper contains 49 sections, 6 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Based on a job description, we create a base resume $c$ that meets all required qualifications. Then, we use LLMs to generate more-qualified candidates $c^+ \succ_j c$, less-qualified candidates $c \succ_i c^-$, and equally-qualified candidates with varying demographic information.
  • Figure 2: $\texttt{DiscrimValidity}$ by model and candidate demographic information type, measuring model abstention rates in deciding between equally-qualified candidates. Model error occurs when a model selects one of the two candidates rather than abstaining.
  • Figure 3: We plot models' SelectionRate for Software Engineer (SW), Business Development Representative, German Speaking (BD), Nurse Practitioner (NP), and Wind Turbine Technician (WT). The expected SelectionRate give pairwise comparisons is 0.5.
  • Figure 4: CriterionValidity by model, occupation for $k=1$, where SW = Software Engineer, NP = Nurse Practitioner, and WT = Wind Turbine Technician.
  • Figure 5: CriterionValidity by model, occupation for $k=3$, where SW = Software Engineer, NP = Nurse Practitioner, and WT = Wind Turbine Technician.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Definition 1: Axioms for Valid Resume Screening