Table of Contents
Fetching ...

Optimal Transport for Fairness: Archival Data Repair using Small Research Data Sets

Abigail Langbridge, Anthony Quinn, Robert Shorten

TL;DR

This work tackles the problem of repairing unfairness in archival data by leveraging a small, S|U-labelled research dataset to design optimal transport-based repair plans. The method enforces conditional independence X ⟂ S | U by targeting the Wasserstein barycentre between S-conditional marginals on a common, interpolated support, enabling off-sample repairs of large archival datasets under stationarity. It introduces a distributional repair framework with feature-wise stratification and KDE-based marginal interpolation, and demonstrates substantial reductions in s|u dependence on both simulated and real data (Adult dataset), closely matching or exceeding state-of-the-art on-sample repairs while enabling scalable off-sample deployment. The approach offers a practical pathway for fairness certification and remediation in real-world data streams, requiring only a small fraction of labelled data to generalize repairs to unbounded archival collections.

Abstract

With the advent of the AI Act and other regulations, there is now an urgent need for algorithms that repair unfairness in training data. In this paper, we define fairness in terms of conditional independence between protected attributes ($S$) and features ($X$), given unprotected attributes ($U$). We address the important setting in which torrents of archival data need to be repaired, using only a small proportion of these data, which are $S|U$-labelled (the research data). We use the latter to design optimal transport (OT)-based repair plans on interpolated supports. This allows {\em off-sample}, labelled, archival data to be repaired, subject to stationarity assumptions. It also significantly reduces the size of the supports of the OT plans, with correspondingly large savings in the cost of their design and of their {\em sequential\/} application to the off-sample data. We provide detailed experimental results with simulated and benchmark real data (the Adult data set). Our performance figures demonstrate effective repair -- in the sense of quenching conditional dependence -- of large quantities of off-sample, labelled (archival) data.

Optimal Transport for Fairness: Archival Data Repair using Small Research Data Sets

TL;DR

This work tackles the problem of repairing unfairness in archival data by leveraging a small, S|U-labelled research dataset to design optimal transport-based repair plans. The method enforces conditional independence X ⟂ S | U by targeting the Wasserstein barycentre between S-conditional marginals on a common, interpolated support, enabling off-sample repairs of large archival datasets under stationarity. It introduces a distributional repair framework with feature-wise stratification and KDE-based marginal interpolation, and demonstrates substantial reductions in s|u dependence on both simulated and real data (Adult dataset), closely matching or exceeding state-of-the-art on-sample repairs while enabling scalable off-sample deployment. The approach offers a practical pathway for fairness certification and remediation in real-world data streams, requiring only a small fraction of labelled data to generalize repairs to unbounded archival collections.

Abstract

With the advent of the AI Act and other regulations, there is now an urgent need for algorithms that repair unfairness in training data. In this paper, we define fairness in terms of conditional independence between protected attributes () and features (), given unprotected attributes (). We address the important setting in which torrents of archival data need to be repaired, using only a small proportion of these data, which are -labelled (the research data). We use the latter to design optimal transport (OT)-based repair plans on interpolated supports. This allows {\em off-sample}, labelled, archival data to be repaired, subject to stationarity assumptions. It also significantly reduces the size of the supports of the OT plans, with correspondingly large savings in the cost of their design and of their {\em sequential\/} application to the off-sample data. We provide detailed experimental results with simulated and benchmark real data (the Adult data set). Our performance figures demonstrate effective repair -- in the sense of quenching conditional dependence -- of large quantities of off-sample, labelled (archival) data.
Paper Structure (21 sections, 21 equations, 4 figures, 2 tables, 2 algorithms)

This paper contains 21 sections, 21 equations, 4 figures, 2 tables, 2 algorithms.

Figures (4)

  • Figure 1: Graphical representation of unfair data under the proposed $S$, $U$, $X$, $\hat{Y}$ model. Nodes in grey are unobserved, or may be unobserved.
  • Figure 2: Graphical representation of fair data under the proposed $S$, $U$, $X$, $\hat{Y}$ model, where the link between $S$ and the fairness-repaired data $X'$ is mediated by $U$.
  • Figure 3: Simulated bivariate Gaussian sub-groups (Section \ref{['sec:simulation']}). Empirical approximation of $E$ (Equation \ref{['eq:hat-kld-u']}) as the size of the research data set, $n_R$, increases. For this experiment, $n_A = 5000$ and $n_Q = 50$.
  • Figure 4: Simulated bivariate Gaussian sub-groups (Section V-A). Empirical approximation of $E$ (Equation \ref{['eq:hat-kld-u']}) as $n_Q$ increases for the composite repaired data set. For this experiment, $n_R = 500$ and $n_A = 5000$.

Theorems & Definitions (4)

  • Definition 2.1: $u$-conditional fairness
  • Definition 2.2: Disparate Treatment
  • Definition 2.3: Disparate Impact
  • Definition 2.4: $s|u$-dependence metric