Table of Contents
Fetching ...

OTClean: Data Cleaning for Conditional Independence Violations using Optimal Transport

Alireza Pirhadi, Mohammad Hossein Moslemi, Alexander Cloninger, Mostafa Milani, Babak Salimi

TL;DR

OTClean tackles enforcing conditional independence (CI) constraints in data used for ML by learning a probabilistic data cleaner via optimal transport (OT). It formulates the exact repair as a Quadratically Constrained Linear Program (QCLP) but then enables scalable estimation through a relaxed OT objective with entropic regularization and Sinkhorn iterations, producing a convergent alternating algorithm (FastOTClean). The approach preserves data utility while enforcing CI, and it is shown to improve algorithmic fairness and data-cleaning performance on multiple real and semi-synthetic datasets, outperforming baselines. The work provides extensive analysis of convergence, runtime, memory, and practical optimizations, highlighting its applicability to streaming or high-dimensional CI constraints and potential extensions to continuous/relational data and multiple CIs.

Abstract

Ensuring Conditional Independence (CI) constraints is pivotal for the development of fair and trustworthy machine learning models. In this paper, we introduce \sys, a framework that harnesses optimal transport theory for data repair under CI constraints. Optimal transport theory provides a rigorous framework for measuring the discrepancy between probability distributions, thereby ensuring control over data utility. We formulate the data repair problem concerning CIs as a Quadratically Constrained Linear Program (QCLP) and propose an alternating method for its solution. However, this approach faces scalability issues due to the computational cost associated with computing optimal transport distances, such as the Wasserstein distance. To overcome these scalability challenges, we reframe our problem as a regularized optimization problem, enabling us to develop an iterative algorithm inspired by Sinkhorn's matrix scaling algorithm, which efficiently addresses high-dimensional and large-scale data. Through extensive experiments, we demonstrate the efficacy and efficiency of our proposed methods, showcasing their practical utility in real-world data cleaning and preprocessing tasks. Furthermore, we provide comparisons with traditional approaches, highlighting the superiority of our techniques in terms of preserving data utility while ensuring adherence to the desired CI constraints.

OTClean: Data Cleaning for Conditional Independence Violations using Optimal Transport

TL;DR

OTClean tackles enforcing conditional independence (CI) constraints in data used for ML by learning a probabilistic data cleaner via optimal transport (OT). It formulates the exact repair as a Quadratically Constrained Linear Program (QCLP) but then enables scalable estimation through a relaxed OT objective with entropic regularization and Sinkhorn iterations, producing a convergent alternating algorithm (FastOTClean). The approach preserves data utility while enforcing CI, and it is shown to improve algorithmic fairness and data-cleaning performance on multiple real and semi-synthetic datasets, outperforming baselines. The work provides extensive analysis of convergence, runtime, memory, and practical optimizations, highlighting its applicability to streaming or high-dimensional CI constraints and potential extensions to continuous/relational data and multiple CIs.

Abstract

Ensuring Conditional Independence (CI) constraints is pivotal for the development of fair and trustworthy machine learning models. In this paper, we introduce \sys, a framework that harnesses optimal transport theory for data repair under CI constraints. Optimal transport theory provides a rigorous framework for measuring the discrepancy between probability distributions, thereby ensuring control over data utility. We formulate the data repair problem concerning CIs as a Quadratically Constrained Linear Program (QCLP) and propose an alternating method for its solution. However, this approach faces scalability issues due to the computational cost associated with computing optimal transport distances, such as the Wasserstein distance. To overcome these scalability challenges, we reframe our problem as a regularized optimization problem, enabling us to develop an iterative algorithm inspired by Sinkhorn's matrix scaling algorithm, which efficiently addresses high-dimensional and large-scale data. Through extensive experiments, we demonstrate the efficacy and efficiency of our proposed methods, showcasing their practical utility in real-world data cleaning and preprocessing tasks. Furthermore, we provide comparisons with traditional approaches, highlighting the superiority of our techniques in terms of preserving data utility while ensuring adherence to the desired CI constraints.
Paper Structure (53 sections, 1 theorem, 12 equations, 17 figures, 3 tables, 2 algorithms)

This paper contains 53 sections, 1 theorem, 12 equations, 17 figures, 3 tables, 2 algorithms.

Key Result

theorem 1

For the optimization problem outlined in Equation eq:relax-opt-clean, Algorithm alg:repair converges.

Figures (17)

  • Figure 1: The coefficient $1/\rho$ in regularized OT impacts the mapping between distributions $P$ and $Q$: higher coefficients (on the right) lead to smoother mappings and spread mass more evenly between $P$ and $Q$.
  • Figure 2: Graphical representation of the plan $\pi({\mathbf{v}}, {\mathbf{v}}')$ for ${D}_2$. Nodes represent elements in $\mathcal{V}$. Labeled red edges indicate joint probabilities $\pi({\mathbf{v}}, {\mathbf{v}}')$, while dashed directed edges depict the probabilistic mapping $\pi({\mathbf{v}} \mid {\mathbf{v}}')$. Only nodes and edges with non-zero probabilities are shown for clarity.
  • Figure 3: The QCLP for Example \ref{['ex:QCLP']}. The top left is the transport plan defined by the decision variables. The top right is $\tilde{Q}$ definitions. The rest are the objective and constraints.
  • Figure 4: Comparison of OTClean's performance with the baselines showing higher AUC and lower ROD (bias)
  • Figure 5: Fairness metrics in OTClean, indicating lower biases (ROD, EO, and DP) compared to baseline methods
  • ...and 12 more figures

Theorems & Definitions (2)

  • definition 1: CI Data Cleaner
  • theorem 1