Table of Contents
Fetching ...

Fair Data Pre-Processing with Imperfect Attribute Space

Ying Zheng, Yangfan Jiang, Kian-Lee Tan

Abstract

Fair data pre-processing is a widely used strategy for mitigating bias in machine learning. A promising line of research focuses on calibrating datasets to satisfy a designed fairness policy so that sensitive attributes influence outcomes only through clearly specified legitimate causal pathways. While effective on clean and information-rich data, these methods often break down in real-world scenarios with imperfect attribute spaces, where decision-relevant factors may be deemed unusable or even missing. To address this gap, we propose LatentPre, a novel framework that enables principled and robust fair data processing in practical settings. Instead of relying solely on observed attributes, LatentPre augments the fairness policy with latent attributes that capture essential but subtle signals, enabling the framework to operate as if the attribute space were perfect. These latent attributes are strategically introduced to guarantee identifiability and are estimated using a tailored expectation-maximization paradigm. The raw data is then carefully refined to conform to this latent-augmented policy, effectively removing biased patterns while preserving justifiable ones. Extensive experiments demonstrate that LatentPre consistently achieves strong fairness-utility trade-offs across diverse scenarios, advancing practical fairness-aware data management.

Fair Data Pre-Processing with Imperfect Attribute Space

Abstract

Fair data pre-processing is a widely used strategy for mitigating bias in machine learning. A promising line of research focuses on calibrating datasets to satisfy a designed fairness policy so that sensitive attributes influence outcomes only through clearly specified legitimate causal pathways. While effective on clean and information-rich data, these methods often break down in real-world scenarios with imperfect attribute spaces, where decision-relevant factors may be deemed unusable or even missing. To address this gap, we propose LatentPre, a novel framework that enables principled and robust fair data processing in practical settings. Instead of relying solely on observed attributes, LatentPre augments the fairness policy with latent attributes that capture essential but subtle signals, enabling the framework to operate as if the attribute space were perfect. These latent attributes are strategically introduced to guarantee identifiability and are estimated using a tailored expectation-maximization paradigm. The raw data is then carefully refined to conform to this latent-augmented policy, effectively removing biased patterns while preserving justifiable ones. Extensive experiments demonstrate that LatentPre consistently achieves strong fairness-utility trade-offs across diverse scenarios, advancing practical fairness-aware data management.

Paper Structure

This paper contains 36 sections, 3 theorems, 32 equations, 12 figures, 2 tables, 4 algorithms.

Key Result

Corollary 1

Let $\mathcal{G}$ be the attribute graph derived from a dataset. If all causal pathways from sensitive attributes to the label pass through at least one admissible attribute, then any reasonable classifier trained on this dataset is regarded as justifiably fair.

Figures (12)

  • Figure 1: Attribute graphs of real-world examples.
  • Figure 2:
  • Figure 4: End-to-end performance under attribute ambiguity across three datasets, measured by AUC (utility) and ROD (fairness). Each box summarizes the 5-fold AUC values and the corresponding average ROD for one approach; higher AUC and lower ROD indicate better performance. Shaded regions indicate invalid results.
  • Figure 5: End-to-end performance under attribute absence across three datasets.
  • Figure 6: End-to-end performance under perfect attribute space across three datasets.
  • ...and 7 more figures

Theorems & Definitions (10)

  • Example 1: Attribute Ambiguity
  • Example 2: Attribute Absence
  • Definition 1: $\mathcal{K}$-fairness salimi2019interventional
  • Definition 2: Justifiable Fairness salimi2019interventional
  • Corollary 1: salimi2019interventional
  • Proposition 1: salimi2019interventionalzheng2025causalpre
  • Definition 3: Identifiability allman2009identifiability
  • Definition 4: Generic Identifiability allman2009identifiability
  • Definition 5: Generic Identifiability up to Label Swapping allman2009identifiability
  • Theorem 1