Table of Contents
Fetching ...

Small Changes, Big Impact: Demographic Bias in LLM-Based Hiring Through Subtle Sociocultural Markers in Anonymised Resumes

Bryan Chen Zhengyu Tan, Shaun Khoo, Bich Ngoc Doan, Zhengyuan Liu, Nancy F. Chen, Roy Ka-Wei Lee

TL;DR

This work introduces a generalisable stress-test framework for hiring fairness, instantiated in the Singapore context, and suggests that seemingly innocuous markers surviving anonymisation can materially skew automated hiring outcomes.

Abstract

Large Language Models (LLMs) are increasingly deployed in resume screening pipelines. Although explicit PII (e.g., names) is commonly redacted, resumes typically retain subtle sociocultural markers (languages, co-curricular activities, volunteering, hobbies) that can act as demographic proxies. We introduce a generalisable stress-test framework for hiring fairness, instantiated in the Singapore context: 100 neutral job-aligned resumes are augmented into 4100 variants spanning four ethnicities and two genders, differing only in job-irrelevant markers. We evaluate 18 LLMs in two realistic settings: (i) Direct Comparison (1v1) and (ii) Score & Shortlist (top-scoring rate), each with and without rationale prompting. Even without explicit identifiers, models recover demographic attributes with high F1 and exhibit systematic disparities, with models favouring markers associated with Chinese and Caucasian males. Ablations show language markers suffice for ethnicity inference, whereas gender relies on hobbies and activities. Furthermore, prompting for explanations tends to amplify bias. Our findings suggest that seemingly innocuous markers surviving anonymisation can materially skew automated hiring outcomes.

Small Changes, Big Impact: Demographic Bias in LLM-Based Hiring Through Subtle Sociocultural Markers in Anonymised Resumes

TL;DR

This work introduces a generalisable stress-test framework for hiring fairness, instantiated in the Singapore context, and suggests that seemingly innocuous markers surviving anonymisation can materially skew automated hiring outcomes.

Abstract

Large Language Models (LLMs) are increasingly deployed in resume screening pipelines. Although explicit PII (e.g., names) is commonly redacted, resumes typically retain subtle sociocultural markers (languages, co-curricular activities, volunteering, hobbies) that can act as demographic proxies. We introduce a generalisable stress-test framework for hiring fairness, instantiated in the Singapore context: 100 neutral job-aligned resumes are augmented into 4100 variants spanning four ethnicities and two genders, differing only in job-irrelevant markers. We evaluate 18 LLMs in two realistic settings: (i) Direct Comparison (1v1) and (ii) Score & Shortlist (top-scoring rate), each with and without rationale prompting. Even without explicit identifiers, models recover demographic attributes with high F1 and exhibit systematic disparities, with models favouring markers associated with Chinese and Caucasian males. Ablations show language markers suffice for ethnicity inference, whereas gender relies on hobbies and activities. Furthermore, prompting for explanations tends to amplify bias. Our findings suggest that seemingly innocuous markers surviving anonymisation can materially skew automated hiring outcomes.
Paper Structure (57 sections, 29 figures, 10 tables)

This paper contains 57 sections, 29 figures, 10 tables.

Figures (29)

  • Figure 1: Overview of the experimental framework. (A) We generate 100.0 neutral resumes and inject them with sociocultural markers to create 4100.0 demographic variants. (B) We test 18.0 LLMs using two evaluation settings: Direct Comparison (1v1) and Score & Shortlist. (C) Results reveal systematic disparities where ethnicity drives the macro-ranking (top vs. bottom tiers) and gender drives the micro-ranking (male advantage within tiers).
  • Figure 2: Ablation study showing class-level F1 for demography recovery by Gemini 3 Flash as sociocultural markers are progressively removed. Removing languages greatly reduces the recoverability of ethnicity; while removing hobbies/activities reduces gender.
  • Figure 3: Aggregate bias metrics by condition (mean $\pm$ SD across 18.0 models). Deviation-from-ideal is larger under Score & Shortlist than under Direct Comparison. Rationale prompting tend to increase bias.
  • Figure 4: Per-model effect of rationale prompting on bias metrics. Positive values indicate an improvement with rationale prompting. Results show model-to-model heterogeneity, suggesting that rationale prompting is an unreliable and model-specific debiasing intervention.
  • Figure 5: Bias landscapes (averaged across rationale and no-rationale) for Direct Comparison (1v1, left) and Score & Shortlist (right). Each point is a model. The x-axis shows normalised demographic disparity ($\max_g-\min_g$), the y-axis shows normalised deviation from the ideal ($0.5$ for Direct Comparison; $1.0$ for Score & Shortlist).
  • ...and 24 more figures