Table of Contents
Fetching ...

PopResume: Causal Fairness Evaluation of LLM/VLM Resume Screeners with Population-Representative Dataset

Sumin Yu, Juhyeon Park, Taesup Moon

Abstract

We present PopResume, a population-representative resume dataset for causal fairness auditing of LLM- and VLM-based resume screening systems. Unlike existing benchmarks that rely on manually injected demographic information and outcome-level disparities, PopResume is grounded in population statistics and preserves natural attribute relationships, enabling path-specific effect (PSE)-based fairness evaluation. We decompose the effect of a protected attribute on resume scores into two paths: the business necessity path, mediated by job-relevant qualifications, and the redlining path, mediated by demographic proxies. This distinction allows auditors to separate legally permissible from impermissible sources of disparity. Evaluating four LLMs and four VLMs on PopResume's 60.8K resumes across five occupations, we identify five representative discrimination patterns that aggregate metrics fail to capture. Our results demonstrate that PSE-based evaluation reveals fairness issues masked by outcome-level measures, underscoring the need for causally-grounded auditing frameworks in AI-assisted hiring.

PopResume: Causal Fairness Evaluation of LLM/VLM Resume Screeners with Population-Representative Dataset

Abstract

We present PopResume, a population-representative resume dataset for causal fairness auditing of LLM- and VLM-based resume screening systems. Unlike existing benchmarks that rely on manually injected demographic information and outcome-level disparities, PopResume is grounded in population statistics and preserves natural attribute relationships, enabling path-specific effect (PSE)-based fairness evaluation. We decompose the effect of a protected attribute on resume scores into two paths: the business necessity path, mediated by job-relevant qualifications, and the redlining path, mediated by demographic proxies. This distinction allows auditors to separate legally permissible from impermissible sources of disparity. Evaluating four LLMs and four VLMs on PopResume's 60.8K resumes across five occupations, we identify five representative discrimination patterns that aggregate metrics fail to capture. Our results demonstrate that PSE-based evaluation reveals fairness issues masked by outcome-level measures, underscoring the need for causally-grounded auditing frameworks in AI-assisted hiring.
Paper Structure (45 sections, 3 theorems, 19 equations, 8 figures, 13 tables)

This paper contains 45 sections, 3 theorems, 19 equations, 8 figures, 13 tables.

Key Result

Lemma 1

Given a sample $\mathcal{D} \overset{i.i.d.}{\sim} P(\mathbf{V})$, doubly robust estimator $\hat{\psi}$ for $\mathbb{E}[Y_{x_1,\mathbf{B}_{x_1,\mathbf{R}_{x_0}}, \mathbf{R}_{x_0}}]$ constructed with following procedure has finite sample guarantee:

Figures (8)

  • Figure 1: Prior works inject protected attributes, making causal framework–based evaluation infeasible, and measure outcome disparity; our population-representative resume dataset instead enables causal effect–based evaluation and pathway decomposition.
  • Figure 2: In our framework, the mediator set $\mathbf{W}$ is decomposed into business necessity components $\mathbf{B}$ and redlining-related components $\mathbf{R}$. Blue edges denote BIE pathways, while red edges denote RIE pathways.
  • Figure 3: Pipeline for constructing the population-representative resume dataset and evaluating LLM/VLM-based resume screeners. *[height=1.9ex]1 Estimation of joint distribution $P(X, \mathbf{Z,B,R}|J)$ based on Assumption 1. *[height=1.9ex]2 Population-representative structured profiles consisting of protected attribute [inner color=black, fill color=white, outer color=none, inner xsep=4.5pt, inner ysep=4.5pt]X, confounder [inner color=black, fill color=white, outer color=none, inner xsep=4.5pt, inner ysep=4.5pt]Z, business necessity mediators [inner color=white, fill color=myblue, outer color=none, inner xsep=4.5pt, inner ysep=4.5pt]B, and redlining mediators [inner color=white, fill color=myred, outer color=none, inner xsep=4.5pt, inner ysep=4.5pt]R. *[height=1.9ex]3 Resume realization, where each structured profile is converted into a natural-language resume using rule-based procedures conditioned solely on assigned attributes, eliminating uncontrolled variation. Three formats are produced: text resumes for LLM evaluation, and resume images with and without synthesized profile photos for VLM evaluation. *[height=1.9ex]4 Resume scoring by LLM/VLM screeners, which assign a score Y given a job description and a resume. *[height=1.9ex]5 Path-specific effect-based evaluation, estimating TE, NDE, and NIE, and further decomposing NIE into BIE and RIE.
  • Figure 4: Representative examples of five cases based on our causal decomposition.
  • Figure 5: PSEs in VLM-based resume scoring.
  • ...and 3 more figures

Theorems & Definitions (5)

  • Lemma 1: DML-UCA, NEURIPS2024_0c4bc137
  • Lemma 2: Parametrization
  • proof
  • Lemma 3: Doubly Robustness
  • proof