Table of Contents
Fetching ...

BAD: BiAs Detection for Large Language Models in the context of candidate screening

Nam Ho Koh, Joseph Plata, Joyce Chai

TL;DR

This work investigates how large language models may perpetuate or mitigate biases in candidate screening by generating a demographically-informed resume dataset and evaluating biases with a context-association test (CAT). It combines resume-generation experiments with statistical bias analysis (chi-squared tests) and CAT metrics to quantify stereotyping tendencies in outputs from GPT-4 and GPT-3.5-turbo, revealing model-dependent variation in bias and highlighting practical implications for hiring systems. The authors open-source their CAT framework and dataset to promote transparency and further study, calling for careful deployment and broader evaluation across models and domains. Overall, the study provides a multi-angle assessment of LLM-induced bias in screening contexts and a baseline for ongoing fairness audits in HR technology.

Abstract

Application Tracking Systems (ATS) have allowed talent managers, recruiters, and college admissions committees to process large volumes of potential candidate applications efficiently. Traditionally, this screening process was conducted manually, creating major bottlenecks due to the quantity of applications and introducing many instances of human bias. The advent of large language models (LLMs) such as ChatGPT and the potential of adopting methods to current automated application screening raises additional bias and fairness issues that must be addressed. In this project, we wish to identify and quantify the instances of social bias in ChatGPT and other OpenAI LLMs in the context of candidate screening in order to demonstrate how the use of these models could perpetuate existing biases and inequalities in the hiring process.

BAD: BiAs Detection for Large Language Models in the context of candidate screening

TL;DR

This work investigates how large language models may perpetuate or mitigate biases in candidate screening by generating a demographically-informed resume dataset and evaluating biases with a context-association test (CAT). It combines resume-generation experiments with statistical bias analysis (chi-squared tests) and CAT metrics to quantify stereotyping tendencies in outputs from GPT-4 and GPT-3.5-turbo, revealing model-dependent variation in bias and highlighting practical implications for hiring systems. The authors open-source their CAT framework and dataset to promote transparency and further study, calling for careful deployment and broader evaluation across models and domains. Overall, the study provides a multi-angle assessment of LLM-induced bias in screening contexts and a baseline for ongoing fairness audits in HR technology.

Abstract

Application Tracking Systems (ATS) have allowed talent managers, recruiters, and college admissions committees to process large volumes of potential candidate applications efficiently. Traditionally, this screening process was conducted manually, creating major bottlenecks due to the quantity of applications and introducing many instances of human bias. The advent of large language models (LLMs) such as ChatGPT and the potential of adopting methods to current automated application screening raises additional bias and fairness issues that must be addressed. In this project, we wish to identify and quantify the instances of social bias in ChatGPT and other OpenAI LLMs in the context of candidate screening in order to demonstrate how the use of these models could perpetuate existing biases and inequalities in the hiring process.
Paper Structure (19 sections, 2 equations, 6 figures, 2 tables)

This paper contains 19 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Breakdown of estimated ethnicity and job area
  • Figure 2: Breakdown of estimated gender and job area
  • Figure 3: Relative Representation for Software Engineering
  • Figure 4: Relative Representation for Marketing
  • Figure 5: Distribution of Estimated Ethnicity
  • ...and 1 more figures