Table of Contents
Fetching ...

Targeted Learning for Data Fairness

Alexander Asemota, Giles Hooker

TL;DR

The paper treats fairness as a data-generating-process problem (data fairness) rather than solely a model- or algorithm-centered issue, and proposes Targeted Learning (TL) as a flexible, nonparametric framework to perform statistical inference on fairness. It derives efficient influence-function-based estimators for traditional and probabilistic demographic parity and equal opportunity, as well as for conditional mutual information (CMI), with double robustness properties for the probabilistic metrics. Through simulations and real-data analyses (Adult-Income and Law School), the authors demonstrate TL's ability to produce valid inference under model misspecification, reveal data-level disparities, and quantify variable-importance in fairness metrics. The work highlights the potential and challenges of data-fairness inference, discusses connections to causal-inference concepts, and points to future directions for extending metrics, inference methods, and remediation strategies in fairness-critical decisions.

Abstract

Data and algorithms have the potential to produce and perpetuate discrimination and disparate treatment. As such, significant effort has been invested in developing approaches to defining, detecting, and eliminating unfair outcomes in algorithms. In this paper, we focus on performing statistical inference for fairness. Prior work in fairness inference has largely focused on inferring the fairness properties of a given predictive algorithm. Here, we expand fairness inference by evaluating fairness in the data generating process itself, referred to here as data fairness. We perform inference on data fairness using targeted learning, a flexible framework for nonparametric inference. We derive estimators demographic parity, equal opportunity, and conditional mutual information. Additionally, we find that our estimators for probabilistic metrics exploit double robustness. To validate our approach, we perform several simulations and apply our estimators to real data.

Targeted Learning for Data Fairness

TL;DR

The paper treats fairness as a data-generating-process problem (data fairness) rather than solely a model- or algorithm-centered issue, and proposes Targeted Learning (TL) as a flexible, nonparametric framework to perform statistical inference on fairness. It derives efficient influence-function-based estimators for traditional and probabilistic demographic parity and equal opportunity, as well as for conditional mutual information (CMI), with double robustness properties for the probabilistic metrics. Through simulations and real-data analyses (Adult-Income and Law School), the authors demonstrate TL's ability to produce valid inference under model misspecification, reveal data-level disparities, and quantify variable-importance in fairness metrics. The work highlights the potential and challenges of data-fairness inference, discusses connections to causal-inference concepts, and points to future directions for extending metrics, inference methods, and remediation strategies in fairness-critical decisions.

Abstract

Data and algorithms have the potential to produce and perpetuate discrimination and disparate treatment. As such, significant effort has been invested in developing approaches to defining, detecting, and eliminating unfair outcomes in algorithms. In this paper, we focus on performing statistical inference for fairness. Prior work in fairness inference has largely focused on inferring the fairness properties of a given predictive algorithm. Here, we expand fairness inference by evaluating fairness in the data generating process itself, referred to here as data fairness. We perform inference on data fairness using targeted learning, a flexible framework for nonparametric inference. We derive estimators demographic parity, equal opportunity, and conditional mutual information. Additionally, we find that our estimators for probabilistic metrics exploit double robustness. To validate our approach, we perform several simulations and apply our estimators to real data.

Paper Structure

This paper contains 23 sections, 34 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Plots for estimating demographic parity in simulation Setting 1. Each point represents an estimate from a simulated dataset. In both plots, the X-axis is sample size and the Y-axis is the estimated demographic parity. The dotted line represents the true value of demographic parity for the distribution. For the lower plot, the bands around each point represent the 95% confidence interval.
  • Figure 2: Heatmap demonstrating coverage in different scenarios and as sample size varies. For each cell in the heatmap, coverage is calculated over 100 simulations.
  • Figure 3: Line plot comparing targeted learning estimate of variance to a t-test estimate of variance.
  • Figure 4: Line plot and heatmap demonstrating error and coverage results for CMI. In the line plots, the X-axis is the value of c and the Y-axis is the error. The solid horizontal line represents and error of 0. For every combination of (c,estimator type, sample size), we perform 100 simulations. The heatmap shows coverage for the TL estimator as c and sample size varies.
  • Figure 5: Feature importance to fairness scores for both the Adult and Law school datasets. The first row contains feature importances for Adult, and the second row contains feature importances for Law school.