Table of Contents
Fetching ...

Data vs. Model Machine Learning Fairness Testing: An Empirical Study

Arumoy Shome, Luis Cruz, Arie van Deursen

TL;DR

This paper proposes a data-centric fairness testing framework that evaluates bias both before (DFM) and after (MFM) ML model training, bridging a key gap in existing post-hoc fairness evaluation. Through an extensive empirical study using 2 metrics, 4 algorithms, 5 real-world datasets, and 1600 evaluation cycles, the authors show a positive DFM–MFM correlation when training data distribution or size changes, indicating that data bias and model bias covary under distribution shifts. They also reveal a trade-off between data size, fairness detection, and computational cost, with smaller training samples more readily exposing fairness issues but potentially impacting performance, while larger datasets mitigate bias at higher resource costs. The findings support using DFM as an early warning to detect data drift and bias upstream, offering practical guidance for reducing development time and enabling more proactive fairness management in ML pipelines. Overall, this work introduces a novel, lifecycle-aware, data-centric fairness testing approach with actionable implications for data collection, monitoring, and fair model deployment.

Abstract

Although several fairness definitions and bias mitigation techniques exist in the literature, all existing solutions evaluate fairness of Machine Learning (ML) systems after the training stage. In this paper, we take the first steps towards evaluating a more holistic approach by testing for fairness both before and after model training. We evaluate the effectiveness of the proposed approach and position it within the ML development lifecycle, using an empirical analysis of the relationship between model dependent and independent fairness metrics. The study uses 2 fairness metrics, 4 ML algorithms, 5 real-world datasets and 1600 fairness evaluation cycles. We find a linear relationship between data and model fairness metrics when the distribution and the size of the training data changes. Our results indicate that testing for fairness prior to training can be a ``cheap'' and effective means of catching a biased data collection process early; detecting data drifts in production systems and minimising execution of full training cycles thus reducing development time and costs.

Data vs. Model Machine Learning Fairness Testing: An Empirical Study

TL;DR

This paper proposes a data-centric fairness testing framework that evaluates bias both before (DFM) and after (MFM) ML model training, bridging a key gap in existing post-hoc fairness evaluation. Through an extensive empirical study using 2 metrics, 4 algorithms, 5 real-world datasets, and 1600 evaluation cycles, the authors show a positive DFM–MFM correlation when training data distribution or size changes, indicating that data bias and model bias covary under distribution shifts. They also reveal a trade-off between data size, fairness detection, and computational cost, with smaller training samples more readily exposing fairness issues but potentially impacting performance, while larger datasets mitigate bias at higher resource costs. The findings support using DFM as an early warning to detect data drift and bias upstream, offering practical guidance for reducing development time and enabling more proactive fairness management in ML pipelines. Overall, this work introduces a novel, lifecycle-aware, data-centric fairness testing approach with actionable implications for data collection, monitoring, and fair model deployment.

Abstract

Although several fairness definitions and bias mitigation techniques exist in the literature, all existing solutions evaluate fairness of Machine Learning (ML) systems after the training stage. In this paper, we take the first steps towards evaluating a more holistic approach by testing for fairness both before and after model training. We evaluate the effectiveness of the proposed approach and position it within the ML development lifecycle, using an empirical analysis of the relationship between model dependent and independent fairness metrics. The study uses 2 fairness metrics, 4 ML algorithms, 5 real-world datasets and 1600 fairness evaluation cycles. We find a linear relationship between data and model fairness metrics when the distribution and the size of the training data changes. Our results indicate that testing for fairness prior to training can be a ``cheap'' and effective means of catching a biased data collection process early; detecting data drifts in production systems and minimising execution of full training cycles thus reducing development time and costs.
Paper Structure (22 sections, 12 figures, 4 tables)

This paper contains 22 sections, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Stages of the ML Lifecycle (adopted from amershi2019softwarebreck2019data). Three distinct phases of the lifecycle are marked by different colours. Stages in the experimental and production phases may loop back to any prior stages, indicated by the large grey arrows. The location of fairness testing using DFM and MFM are marked by the green labels. The green arrow depicts the shift proposed by this study in ML fairness testing.
  • Figure 2: Methodology for evaluating fairness of datasets and ML models using DFM and MFM.
  • Figure 3: Boxplot showing distribution of DFM and MFM for all datasets, models and fairness metrics.
  • Figure 4: Visual explanation of rationale for using smaller training sample to simulate change in the distribution of the training data. The grey boxes represent the full dataset while the blue boxes represent the training set for three hypothetical iterations. More overlap in the blue boxes depicts less distribution change and vice-versa.
  • Figure 5: (left) Lineplot showing relationship between performance metrics and training sample size in the german-age dataset. Data from the 50 iterations is aggregated using the mean, the error bars show the standard deviation. (right) Countplot showing number of cases with significant change in accuracy and f1 when trained using the full vs. smaller training sample size.
  • ...and 7 more figures