Table of Contents
Fetching ...

Whence Is A Model Fair? Fixing Fairness Bugs via Propensity Score Matching

Kewen Peng, Yicheng Yang, Hao Zhuo

TL;DR

This work argues that fairness assessments in ML are fragile when test data do not reflect future distributions, and proposes propensity score matching (PSM) as a principled way to evaluate and mitigate bias. It introduces FairMatch, a post-processing approach that uses PSM to identify matched and unmatched test samples and to calibrate decision thresholds across protected groups, with probabilistic calibration for unmatched cases. Empirical results across six datasets show that fairness metrics are highly sensitive to sampling, while FairMatch can achieve competitive or superior fairness-performance trade-offs without sacrificing predictive accuracy. The proposed causality-driven fairness testing framework and the FairMatch algorithm offer a practical, data-driven path to more reliable fairness evaluation and mitigation in real-world ML deployments.

Abstract

Fairness-aware learning aims to mitigate discrimination against specific protected social groups (e.g., those categorized by gender, ethnicity, age) while minimizing predictive performance loss. Despite efforts to improve fairness in machine learning, prior studies have shown that many models remain unfair when measured against various fairness metrics. In this paper, we examine whether the way training and testing data are sampled affects the reliability of reported fairness metrics. Since training and test sets are often randomly sampled from the same population, bias present in the training data may still exist in the test data, potentially skewing fairness assessments. To address this, we propose FairMatch, a post-processing method that applies propensity score matching to evaluate and mitigate bias. FairMatch identifies control and treatment pairs with similar propensity scores in the test set and adjusts decision thresholds for different subgroups accordingly. For samples that cannot be matched, we perform probabilistic calibration using fairness-aware loss functions. Experimental results demonstrate that our approach can (a) precisely locate subsets of the test data where the model is unbiased, and (b) significantly reduce bias on the remaining data. Overall, propensity score matching offers a principled way to improve both fairness evaluation and mitigation, without sacrificing predictive performance.

Whence Is A Model Fair? Fixing Fairness Bugs via Propensity Score Matching

TL;DR

This work argues that fairness assessments in ML are fragile when test data do not reflect future distributions, and proposes propensity score matching (PSM) as a principled way to evaluate and mitigate bias. It introduces FairMatch, a post-processing approach that uses PSM to identify matched and unmatched test samples and to calibrate decision thresholds across protected groups, with probabilistic calibration for unmatched cases. Empirical results across six datasets show that fairness metrics are highly sensitive to sampling, while FairMatch can achieve competitive or superior fairness-performance trade-offs without sacrificing predictive accuracy. The proposed causality-driven fairness testing framework and the FairMatch algorithm offer a practical, data-driven path to more reliable fairness evaluation and mitigation in real-world ML deployments.

Abstract

Fairness-aware learning aims to mitigate discrimination against specific protected social groups (e.g., those categorized by gender, ethnicity, age) while minimizing predictive performance loss. Despite efforts to improve fairness in machine learning, prior studies have shown that many models remain unfair when measured against various fairness metrics. In this paper, we examine whether the way training and testing data are sampled affects the reliability of reported fairness metrics. Since training and test sets are often randomly sampled from the same population, bias present in the training data may still exist in the test data, potentially skewing fairness assessments. To address this, we propose FairMatch, a post-processing method that applies propensity score matching to evaluate and mitigate bias. FairMatch identifies control and treatment pairs with similar propensity scores in the test set and adjusts decision thresholds for different subgroups accordingly. For samples that cannot be matched, we perform probabilistic calibration using fairness-aware loss functions. Experimental results demonstrate that our approach can (a) precisely locate subsets of the test data where the model is unbiased, and (b) significantly reduce bias on the remaining data. Overall, propensity score matching offers a principled way to improve both fairness evaluation and mitigation, without sacrificing predictive performance.

Paper Structure

This paper contains 20 sections, 4 equations, 5 figures, 8 tables, 2 algorithms.

Figures (5)

  • Figure 1: Fairness testing proposed in this paper. To evaluate the effectiveness of any de-biasing model, we want to ensure that the test set reflects the ideal data distribution in the future.
  • Figure 2: Curves drawn from three datasets. The gray error area indicates the standard deviation as we use different random seeds for sub-sampling. The x-axis represents the percentage of unmatched testing samples used when calculating the fairness scores. The y-axis represents the fairness metric (DI in this case), where lower scores indicate better fairness.
  • Figure 3: Propensity score matching selects samples that represent the future distribution of class labels and the (expected) future distribution of protected attributes.
  • Figure 4: Examples derived from three datasets. The distribution of propensity scores is drawn among the matched/unmatched samples.
  • Figure 5: RQ3 result: The comparison of performance-fairness trade-offs before and after applying PSM. Each arrow represents a set of comparisons conducted on the datasets listed in Table \ref{['tab:dataset']}. For both axes, smaller values indicate better performance/fairness.