Table of Contents
Fetching ...

Evaluating Bias in LLMs for Job-Resume Matching: Gender, Race, and Education

Hayate Iso, Pouya Pezeshkpour, Nikita Bhutani, Estevam Hruschka

TL;DR

This work assesses fairness in LLM-driven job-resume matching within the U.S. and English context by systematically manipulating candidate demographics (gender, race) and educational background across 40 occupations, 12 models, and a large synthetic variant space. Using a 1–10 matching score and ROC AUC evaluation, it shows that explicit biases against gender and race have diminished in recent models, while implicit biases related to educational prestige persist. The study highlights the need for continuous fairness auditing and advanced mitigation strategies to prevent biased hiring outcomes in real-world HR deployments. The findings have practical implications for industry practitioners seeking equitable AI-powered recruitment and for researchers developing robust bias mitigation techniques.

Abstract

Large Language Models (LLMs) offer the potential to automate hiring by matching job descriptions with candidate resumes, streamlining recruitment processes, and reducing operational costs. However, biases inherent in these models may lead to unfair hiring practices, reinforcing societal prejudices and undermining workplace diversity. This study examines the performance and fairness of LLMs in job-resume matching tasks within the English language and U.S. context. It evaluates how factors such as gender, race, and educational background influence model decisions, providing critical insights into the fairness and reliability of LLMs in HR applications. Our findings indicate that while recent models have reduced biases related to explicit attributes like gender and race, implicit biases concerning educational background remain significant. These results highlight the need for ongoing evaluation and the development of advanced bias mitigation strategies to ensure equitable hiring practices when using LLMs in industry settings.

Evaluating Bias in LLMs for Job-Resume Matching: Gender, Race, and Education

TL;DR

This work assesses fairness in LLM-driven job-resume matching within the U.S. and English context by systematically manipulating candidate demographics (gender, race) and educational background across 40 occupations, 12 models, and a large synthetic variant space. Using a 1–10 matching score and ROC AUC evaluation, it shows that explicit biases against gender and race have diminished in recent models, while implicit biases related to educational prestige persist. The study highlights the need for continuous fairness auditing and advanced mitigation strategies to prevent biased hiring outcomes in real-world HR deployments. The findings have practical implications for industry practitioners seeking equitable AI-powered recruitment and for researchers developing robust bias mitigation techniques.

Abstract

Large Language Models (LLMs) offer the potential to automate hiring by matching job descriptions with candidate resumes, streamlining recruitment processes, and reducing operational costs. However, biases inherent in these models may lead to unfair hiring practices, reinforcing societal prejudices and undermining workplace diversity. This study examines the performance and fairness of LLMs in job-resume matching tasks within the English language and U.S. context. It evaluates how factors such as gender, race, and educational background influence model decisions, providing critical insights into the fairness and reliability of LLMs in HR applications. Our findings indicate that while recent models have reduced biases related to explicit attributes like gender and race, implicit biases concerning educational background remain significant. These results highlight the need for ongoing evaluation and the development of advanced bias mitigation strategies to ensure equitable hiring practices when using LLMs in industry settings.

Paper Structure

This paper contains 21 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Pipeline for evaluating bias in LLM-based job-resume matching systems. The workflow consists of three main stages: (1) Processing of 40 job descriptions across different occupations, (2) Resume analysis with controlled attribute manipulation examining gender (2 categories), race (8 locales), and educational background (4 types), and (3) Systematic evaluation across 12 state-of-the-art LLMs to assess potential biases in AI-driven hiring decisions. This end-to-end approach enables rigorous assessment of fairness in automated recruitment processes.
  • Figure 2: (a) ROC AUC scores showing matching accuracy, where 1.0 indicates perfect classification, (b) Gender bias percentage of all 40 occupations where the model shows statistically significant gender bias, (c) Racial bias percentage across job categories, and (d) Educational bias percentage in hiring decisions. The dashed lines represent ideal targets: perfect matching (1.0 ROC AUC) and complete absence of bias (0%). The analysis tracks the evolution of 12 different LLM versions, demonstrating both progress and persistent challenges in achieving fair AI-driven hiring practices.