Table of Contents
Fetching ...

The influence of missing data mechanisms and simple missing data handling techniques on fairness

Aeysha Bhatti, Trudie Sandrock, Johane Nienkemper-Swanepoel

TL;DR

This study investigates how missing data mechanisms (MCAR, MAR, MNAR) and simple handling techniques (LD, mode, regression, knn imputation) influence fairness and accuracy in ML classifiers across three real-world datasets (German credit, Adult, COMPAS). Using artificial amputation to control the missingness mechanism and a four-model ensemble, the authors find MAR can substantially affect fairness distributions and that simple methods like LD and mode imputation often yield higher fairness than more complex imputations, though sometimes at the cost of accuracy. The results emphasize the importance of considering the missing data pathway when evaluating fairness, and they reveal dataset-specific patterns in how missingness and imputation interact with different fairness metrics (dp, eo, pe). The work highlights practical guidance for fairness-sensitive applications, suggesting mode imputation as a low-cost option to improve fairness in some MAR contexts and calling for future work on more advanced imputation like multiple imputation to assess potential fairness gains.

Abstract

Fairness of machine learning algorithms is receiving increasing attention, as such algorithms permeate the day-to-day aspects of our lives. One way in which bias can manifest in a dataset is through missing values. If data are missing, these data are often assumed to be missing completely randomly; in reality the propensity of data being missing is often tied to the demographic characteristics of individuals. There is limited research into how missing values and the handling thereof can impact the fairness of an algorithm. Most researchers either apply listwise deletion or tend to use the simpler methods of imputation (e.g. mean or mode) compared to the more advanced ones (e.g. multiple imputation); we therefore study the impact of the simpler methods on the fairness of algorithms. The starting point of the study is the mechanism of missingness, leading into how the missing data are processed and finally how this impacts fairness. Three popular datasets in the field of fairness are amputed in a simulation study. The results show that under certain scenarios the impact on fairness can be pronounced when the missingness mechanism is missing at random. Furthermore, elementary missing data handling techniques like listwise deletion and mode imputation can lead to higher fairness compared to more complex imputation methods like k-nearest neighbour imputation, albeit often at the cost of lower accuracy.

The influence of missing data mechanisms and simple missing data handling techniques on fairness

TL;DR

This study investigates how missing data mechanisms (MCAR, MAR, MNAR) and simple handling techniques (LD, mode, regression, knn imputation) influence fairness and accuracy in ML classifiers across three real-world datasets (German credit, Adult, COMPAS). Using artificial amputation to control the missingness mechanism and a four-model ensemble, the authors find MAR can substantially affect fairness distributions and that simple methods like LD and mode imputation often yield higher fairness than more complex imputations, though sometimes at the cost of accuracy. The results emphasize the importance of considering the missing data pathway when evaluating fairness, and they reveal dataset-specific patterns in how missingness and imputation interact with different fairness metrics (dp, eo, pe). The work highlights practical guidance for fairness-sensitive applications, suggesting mode imputation as a low-cost option to improve fairness in some MAR contexts and calling for future work on more advanced imputation like multiple imputation to assess potential fairness gains.

Abstract

Fairness of machine learning algorithms is receiving increasing attention, as such algorithms permeate the day-to-day aspects of our lives. One way in which bias can manifest in a dataset is through missing values. If data are missing, these data are often assumed to be missing completely randomly; in reality the propensity of data being missing is often tied to the demographic characteristics of individuals. There is limited research into how missing values and the handling thereof can impact the fairness of an algorithm. Most researchers either apply listwise deletion or tend to use the simpler methods of imputation (e.g. mean or mode) compared to the more advanced ones (e.g. multiple imputation); we therefore study the impact of the simpler methods on the fairness of algorithms. The starting point of the study is the mechanism of missingness, leading into how the missing data are processed and finally how this impacts fairness. Three popular datasets in the field of fairness are amputed in a simulation study. The results show that under certain scenarios the impact on fairness can be pronounced when the missingness mechanism is missing at random. Furthermore, elementary missing data handling techniques like listwise deletion and mode imputation can lead to higher fairness compared to more complex imputation methods like k-nearest neighbour imputation, albeit often at the cost of lower accuracy.

Paper Structure

This paper contains 25 sections, 10 equations, 5 figures.

Figures (5)

  • Figure 1: Adult dataset, sensitive variable sex, model rf, mode imputation
  • Figure 2: Adult dataset, sensitive variable sex, model rf, reg imputation
  • Figure 3: German dataset, pe distributions
  • Figure 4: Adult dataset, eo distributions
  • Figure 5: Adult dataset, pe distributions