Table of Contents
Fetching ...

Causal Synthetic Data Generation in Recruitment

Andrea Iommi, Antonio Mastropietro, Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri

TL;DR

This work tackles the privacy-constrained problem of training fair recruitment models by introducing a causal generative data framework with two domain-specific CGMs for job offers and curricula. By eliciting expert-driven causal graphs and embedding a tunable bias parameter $\boldsymbol{\alpha}$, the approach enables controlled generation of synthetic tabular data while preserving interpretability and enabling counterfactual fairness analyses in ranking. The paper defines a downstream ranking setup with a linear ground-truth score and evaluates fairness via Demographic Parity (DP) and normalised Discounted Difference (rND), demonstrating how distributional shifts in gender-conditioned working hours influence ranking fairness. Its contributions include (i) domain-specific causal graphs for HR data, (ii) a causality-grounded SDG method for mixed data types, and (iii) an open-source Python toolkit for extending SDG with external knowledge like ESCO. The practical impact lies in providing a transparent, auditable framework to study and mitigate bias in candidate rankings under privacy-preserving synthetic data regimes, with potential applicability to policy and governance in high-stakes hiring contexts.

Abstract

The importance of Synthetic Data Generation (SDG) has increased significantly in domains where data quality is poor or access is limited due to privacy and regulatory constraints. One such domain is recruitment, where publicly available datasets are scarce due to the sensitive nature of information typically found in curricula vitae, such as gender, disability status, or age. This lack of accessible, representative data presents a significant obstacle to the development of fair and transparent machine learning models, particularly ranking algorithms that require large volumes of data to effectively learn how to recommend candidates. In the absence of such data, these models are prone to poor generalisation and may fail to perform reliably in real-world scenarios. Recent advances in Causal Generative Models (CGMs) offer a promising solution. CGMs enable the generation of synthetic datasets that preserve the underlying causal relationships within the data, providing greater control over fairness and interpretability in the data generation process. In this study, we present a specialised SDG method involving two CGMs: one modelling job offers and the other modelling curricula. Each model is structured according to a causal graph informed by domain expertise. We use these models to generate synthetic datasets and evaluate the fairness of candidate rankings under controlled scenarios that introduce specific biases.

Causal Synthetic Data Generation in Recruitment

TL;DR

This work tackles the privacy-constrained problem of training fair recruitment models by introducing a causal generative data framework with two domain-specific CGMs for job offers and curricula. By eliciting expert-driven causal graphs and embedding a tunable bias parameter , the approach enables controlled generation of synthetic tabular data while preserving interpretability and enabling counterfactual fairness analyses in ranking. The paper defines a downstream ranking setup with a linear ground-truth score and evaluates fairness via Demographic Parity (DP) and normalised Discounted Difference (rND), demonstrating how distributional shifts in gender-conditioned working hours influence ranking fairness. Its contributions include (i) domain-specific causal graphs for HR data, (ii) a causality-grounded SDG method for mixed data types, and (iii) an open-source Python toolkit for extending SDG with external knowledge like ESCO. The practical impact lies in providing a transparent, auditable framework to study and mitigate bias in candidate rankings under privacy-preserving synthetic data regimes, with potential applicability to policy and governance in high-stakes hiring contexts.

Abstract

The importance of Synthetic Data Generation (SDG) has increased significantly in domains where data quality is poor or access is limited due to privacy and regulatory constraints. One such domain is recruitment, where publicly available datasets are scarce due to the sensitive nature of information typically found in curricula vitae, such as gender, disability status, or age. This lack of accessible, representative data presents a significant obstacle to the development of fair and transparent machine learning models, particularly ranking algorithms that require large volumes of data to effectively learn how to recommend candidates. In the absence of such data, these models are prone to poor generalisation and may fail to perform reliably in real-world scenarios. Recent advances in Causal Generative Models (CGMs) offer a promising solution. CGMs enable the generation of synthetic datasets that preserve the underlying causal relationships within the data, providing greater control over fairness and interpretability in the data generation process. In this study, we present a specialised SDG method involving two CGMs: one modelling job offers and the other modelling curricula. Each model is structured according to a causal graph informed by domain expertise. We use these models to generate synthetic datasets and evaluate the fairness of candidate rankings under controlled scenarios that introduce specific biases.

Paper Structure

This paper contains 23 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Job posting generation process as reconstructed from the interviews with HR professionals. The meaning of an arrow $A \rightarrow B$ is "$A$ determines $B$".
  • Figure 2: Causal graphs adopted for the experiments: left for job offers, right for curricula. The SDG developed is fully general and accepts DAG causal graphs as input. The dotted line in the curriculum's causal graph controls for bias through the parameter $\alpha$. In our experiment, all features are categorical/ordinal, except for "skills" which is a set of categorical values.
  • Figure 3: Demographic Parity difference (DP) and normalised Discounted Difference (rND) at the variation of the bias-controlling parameters $\alpha_0$ (for not-male) and $\alpha_1$ (for male). DP and rND are averaged over 10 runs, each one with 300 job offers. The series represents the ranking model of Eq. (\ref{['formula:ranking_model']}) for different weights to the fitness value of working hours. Shadows represent $\pm 1$ standard deviation over the $10$ runs.
  • Figure 4: Preprocessing pipeline for skills.