Table of Contents
Fetching ...

Using Synthetic Data to Mitigate Unfairness and Preserve Privacy in Collaborative Machine Learning

Chia-Yuan Wu, Frank E. Curtis, Daniel P. Robinson

TL;DR

This work tackles fairness and privacy in collaborative machine learning by eliminating iterative data exchange between clients and the server. It introduces a two-stage synthetic-data framework: first, each client solves a bilevel optimization with covariance-based SP/EO fairness constraints to produce a fair first-stage synthetic dataset; second, a differential privacy–guaranteed synthetic-data generator produces a second dataset (DP-CTGAN) for transmission. The server trains a global model on the aggregated synthetic data, avoiding fairness or privacy considerations at the server and thus simplifying the training process. Empirical results on Law School data demonstrate that increasing the fairness penalty reduces SPD and EO differences with only minor accuracy loss, while the DP stage preserves privacy and smaller synthetic data transfers still yield strong fairness, accuracy, and cost savings due to the single transmission design.

Abstract

In distributed computing environments, collaborative machine learning enables multiple clients to train a global model collaboratively. To preserve privacy in such settings, a common technique is to utilize frequent updates and transmissions of model parameters. However, this results in high communication costs between the clients and the server. To tackle unfairness concerns in distributed environments, client-specific information (e.g., local dataset size or data-related fairness metrics) must be sent to the server to compute algorithmic quantities (e.g., aggregation weights), which leads to a potential leakage of client information. To address these challenges, we propose a two-stage strategy that promotes fair predictions, prevents client-data leakage, and reduces communication costs in certain scenarios without the need to pass information between clients and server iteratively. In the first stage, for each client, we use its local dataset to obtain a synthetic dataset by solving a bilevel optimization problem that aims to ensure that the ultimate global model yields fair predictions. In the second stage, we apply a method with differential privacy guarantees to the synthetic dataset from the first stage to obtain a second synthetic data. We then pass each client's second-stage synthetic dataset to the server, the collection of which is used to train the server model using conventional machine learning techniques (that no longer need to take fairness metrics or privacy into account). Thus, we eliminate the need to handle fairness-specific aggregation weights while preserving client privacy. Our approach requires only a single communication between the clients and the server (thus making it communication cost-effective), maintains data privacy, and promotes fairness. We present empirical evidence to demonstrate the advantages of our approach.

Using Synthetic Data to Mitigate Unfairness and Preserve Privacy in Collaborative Machine Learning

TL;DR

This work tackles fairness and privacy in collaborative machine learning by eliminating iterative data exchange between clients and the server. It introduces a two-stage synthetic-data framework: first, each client solves a bilevel optimization with covariance-based SP/EO fairness constraints to produce a fair first-stage synthetic dataset; second, a differential privacy–guaranteed synthetic-data generator produces a second dataset (DP-CTGAN) for transmission. The server trains a global model on the aggregated synthetic data, avoiding fairness or privacy considerations at the server and thus simplifying the training process. Empirical results on Law School data demonstrate that increasing the fairness penalty reduces SPD and EO differences with only minor accuracy loss, while the DP stage preserves privacy and smaller synthetic data transfers still yield strong fairness, accuracy, and cost savings due to the single transmission design.

Abstract

In distributed computing environments, collaborative machine learning enables multiple clients to train a global model collaboratively. To preserve privacy in such settings, a common technique is to utilize frequent updates and transmissions of model parameters. However, this results in high communication costs between the clients and the server. To tackle unfairness concerns in distributed environments, client-specific information (e.g., local dataset size or data-related fairness metrics) must be sent to the server to compute algorithmic quantities (e.g., aggregation weights), which leads to a potential leakage of client information. To address these challenges, we propose a two-stage strategy that promotes fair predictions, prevents client-data leakage, and reduces communication costs in certain scenarios without the need to pass information between clients and server iteratively. In the first stage, for each client, we use its local dataset to obtain a synthetic dataset by solving a bilevel optimization problem that aims to ensure that the ultimate global model yields fair predictions. In the second stage, we apply a method with differential privacy guarantees to the synthetic dataset from the first stage to obtain a second synthetic data. We then pass each client's second-stage synthetic dataset to the server, the collection of which is used to train the server model using conventional machine learning techniques (that no longer need to take fairness metrics or privacy into account). Thus, we eliminate the need to handle fairness-specific aggregation weights while preserving client privacy. Our approach requires only a single communication between the clients and the server (thus making it communication cost-effective), maintains data privacy, and promotes fairness. We present empirical evidence to demonstrate the advantages of our approach.
Paper Structure (15 sections, 18 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 18 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: The above images illustrate our proposed approach for addressing fairness and privacy through collaborative machine learning. The images used in the illustration are from flaticon.
  • Figure 2: CML Method for Ensuring Fairness and Preserving Privacy
  • Figure 3: Performance trends of the accuracy, the covariance estimate (see \ref{['equ:DBC']}), and the absolute value of the SPD fairness measure (see \ref{['equ:fairness-measures']}) for different values of the penalty parameter $\rho_o$, computed using the client testing data on the Law School dataset. Each plot compares the following: the baseline corresponds to $\rho_0 = 0$ so that fairness is completely ignored, syn_1(100%) is the first-stage synthetic dataset with $N_s^1 = N$, syn_2(100%) is the second-stage synthetic dataset with $N_s^2 = N$, and syn_2(10%) is the second-stage synthetic dataset with $N_s^2 = 0.1N$.
  • Figure 4: Performance trends of the accuracy, the covariance estimate (see \ref{['equ:DBC']}), and the absolute value of the EOD fairness measure (see \ref{['equ:fairness-measures']}) for different values of the penalty parameter $\rho_o$, computed using the client testing data on the Law School dataset. Each plot compares the following: the baseline corresponds to $\rho_0$ so that fairness is completely ignored, syn_1(100%) is the first-stage synthetic dataset with $N_s^1 = N$, syn_2(100%) is the second-stage synthetic dataset with $N_s^2 = N$, and syn_2(10%) is the second-stage synthetic dataset with $N_s^2 = 0.1N$.