Table of Contents
Fetching ...

FAIRE: Assessing Racial and Gender Bias in AI-Driven Resume Evaluations

Athena Wen, Tanush Patil, Ansh Saxena, Yicheng Fu, Sean O'Brien, Kevin Zhu

TL;DR

FAIRE tackles fairness in AI-driven resume screening by benchmarking racial and gender bias in large language models. It introduces direct scoring and ranking as complementary evaluation methods, perturbing resumes with demographic cues across ten professions and testing multiple models. The study reveals pervasive bias with model-dependent patterns: GPT-4o often favors Asian resumes, GPT-4o-mini shows pronounced sensitivity, Claude Haiku is comparatively balanced, while Sonnet and Llama 3.3 70B exhibit stronger biases in several dimensions. The open-source benchmark and data provide a practical tool for ongoing fairness assessment and bias-reduction efforts in AI hiring, with important ethical implications for deployment.

Abstract

In an era where AI-driven hiring is transforming recruitment practices, concerns about fairness and bias have become increasingly important. To explore these issues, we introduce a benchmark, FAIRE (Fairness Assessment In Resume Evaluation), to test for racial and gender bias in large language models (LLMs) used to evaluate resumes across different industries. We use two methods-direct scoring and ranking-to measure how model performance changes when resumes are slightly altered to reflect different racial or gender identities. Our findings reveal that while every model exhibits some degree of bias, the magnitude and direction vary considerably. This benchmark provides a clear way to examine these differences and offers valuable insights into the fairness of AI-based hiring tools. It highlights the urgent need for strategies to reduce bias in AI-driven recruitment. Our benchmark code and dataset are open-sourced at our repository: https://github.com/athenawen/FAIRE-Fairness-Assessment-In-Resume-Evaluation.git.

FAIRE: Assessing Racial and Gender Bias in AI-Driven Resume Evaluations

TL;DR

FAIRE tackles fairness in AI-driven resume screening by benchmarking racial and gender bias in large language models. It introduces direct scoring and ranking as complementary evaluation methods, perturbing resumes with demographic cues across ten professions and testing multiple models. The study reveals pervasive bias with model-dependent patterns: GPT-4o often favors Asian resumes, GPT-4o-mini shows pronounced sensitivity, Claude Haiku is comparatively balanced, while Sonnet and Llama 3.3 70B exhibit stronger biases in several dimensions. The open-source benchmark and data provide a practical tool for ongoing fairness assessment and bias-reduction efforts in AI hiring, with important ethical implications for deployment.

Abstract

In an era where AI-driven hiring is transforming recruitment practices, concerns about fairness and bias have become increasingly important. To explore these issues, we introduce a benchmark, FAIRE (Fairness Assessment In Resume Evaluation), to test for racial and gender bias in large language models (LLMs) used to evaluate resumes across different industries. We use two methods-direct scoring and ranking-to measure how model performance changes when resumes are slightly altered to reflect different racial or gender identities. Our findings reveal that while every model exhibits some degree of bias, the magnitude and direction vary considerably. This benchmark provides a clear way to examine these differences and offers valuable insights into the fairness of AI-based hiring tools. It highlights the urgent need for strategies to reduce bias in AI-driven recruitment. Our benchmark code and dataset are open-sourced at our repository: https://github.com/athenawen/FAIRE-Fairness-Assessment-In-Resume-Evaluation.git.

Paper Structure

This paper contains 16 sections, 2 figures, 14 tables.

Figures (2)

  • Figure 1: Overview of Direct Scoring and Ranking Evaluation. In the Direct Scoring setup, the LLM assigns a score to each resume for every evaluation dimension based on the job description. In the Ranking setup, the LLM ranks a batch of 5 resumes according to their overall strength.
  • Figure 2: Ranking evaluation results by different LLMs. Higher ranking score displays weaker perceived resume strength by LLMs.