Table of Contents
Fetching ...

Bt-GAN: Generating Fair Synthetic Healthdata via Bias-transforming Generative Adversarial Networks

Resmi Ramachandranpillai, Md Fahim Sikder, David Bergström, Fredrik Heintz

TL;DR

Bt-GAN tackles fairness in synthetic healthcare data by transforming the Data Generation Process to be fair while preserving utility. It introduces a Bias-transforming DGP (Bt-DGP) built on a semi-supervised Triple GAN, augmented with mutual information de-biasing to suppress correlations with protected attributes and a LDSS-based density-preserving sampling to maintain representation across sub-groups. The framework also employs Discriminator Rejection Sampling to correct sampling biases and uses SHAP for explainability. Empirical evaluations on MIMIC-III and fairness benchmarks show Bt-GAN achieves state-of-the-art utility with substantially reduced bias amplification and near-zero fairness gaps across sub-groups, suggesting more reliable downstream healthcare predictions from synthetic data.

Abstract

Synthetic data generation offers a promising solution to enhance the usefulness of Electronic Healthcare Records (EHR) by generating realistic de-identified data. However, the existing literature primarily focuses on the quality of synthetic health data, neglecting the crucial aspect of fairness in downstream predictions. Consequently, models trained on synthetic EHR have faced criticism for producing biased outcomes in target tasks. These biases can arise from either spurious correlations between features or the failure of models to accurately represent sub-groups. To address these concerns, we present Bias-transforming Generative Adversarial Networks (Bt-GAN), a GAN-based synthetic data generator specifically designed for the healthcare domain. In order to tackle spurious correlations (i), we propose an information-constrained Data Generation Process that enables the generator to learn a fair deterministic transformation based on a well-defined notion of algorithmic fairness. To overcome the challenge of capturing exact sub-group representations (ii), we incentivize the generator to preserve sub-group densities through score-based weighted sampling. This approach compels the generator to learn from underrepresented regions of the data manifold. We conduct extensive experiments using the MIMIC-III database. Our results demonstrate that Bt-GAN achieves SOTA accuracy while significantly improving fairness and minimizing bias amplification. We also perform an in-depth explainability analysis to provide additional evidence supporting the validity of our study. In conclusion, our research introduces a novel and professional approach to addressing the limitations of synthetic data generation in the healthcare domain. By incorporating fairness considerations and leveraging advanced techniques such as GANs, we pave the way for more reliable and unbiased predictions in healthcare applications.

Bt-GAN: Generating Fair Synthetic Healthdata via Bias-transforming Generative Adversarial Networks

TL;DR

Bt-GAN tackles fairness in synthetic healthcare data by transforming the Data Generation Process to be fair while preserving utility. It introduces a Bias-transforming DGP (Bt-DGP) built on a semi-supervised Triple GAN, augmented with mutual information de-biasing to suppress correlations with protected attributes and a LDSS-based density-preserving sampling to maintain representation across sub-groups. The framework also employs Discriminator Rejection Sampling to correct sampling biases and uses SHAP for explainability. Empirical evaluations on MIMIC-III and fairness benchmarks show Bt-GAN achieves state-of-the-art utility with substantially reduced bias amplification and near-zero fairness gaps across sub-groups, suggesting more reliable downstream healthcare predictions from synthetic data.

Abstract

Synthetic data generation offers a promising solution to enhance the usefulness of Electronic Healthcare Records (EHR) by generating realistic de-identified data. However, the existing literature primarily focuses on the quality of synthetic health data, neglecting the crucial aspect of fairness in downstream predictions. Consequently, models trained on synthetic EHR have faced criticism for producing biased outcomes in target tasks. These biases can arise from either spurious correlations between features or the failure of models to accurately represent sub-groups. To address these concerns, we present Bias-transforming Generative Adversarial Networks (Bt-GAN), a GAN-based synthetic data generator specifically designed for the healthcare domain. In order to tackle spurious correlations (i), we propose an information-constrained Data Generation Process that enables the generator to learn a fair deterministic transformation based on a well-defined notion of algorithmic fairness. To overcome the challenge of capturing exact sub-group representations (ii), we incentivize the generator to preserve sub-group densities through score-based weighted sampling. This approach compels the generator to learn from underrepresented regions of the data manifold. We conduct extensive experiments using the MIMIC-III database. Our results demonstrate that Bt-GAN achieves SOTA accuracy while significantly improving fairness and minimizing bias amplification. We also perform an in-depth explainability analysis to provide additional evidence supporting the validity of our study. In conclusion, our research introduces a novel and professional approach to addressing the limitations of synthetic data generation in the healthcare domain. By incorporating fairness considerations and leveraging advanced techniques such as GANs, we pave the way for more reliable and unbiased predictions in healthcare applications.
Paper Structure (44 sections, 14 equations, 10 figures, 9 tables)

This paper contains 44 sections, 14 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: An example of partial recall for groups with underrepresentation (minor), overrepresentation (major), and adequate representation in data generation.
  • Figure 2: Architecture of Bt-GAN: the utilities of $\mathrm{C}_{\varphi}, \mathrm{G}_{\theta}$, $\mathrm D_{\zeta}$, and $\mathrm{D}_{\phi}$ are shown. The symbols $A$ and $R$ denote accept and reject respectively. The discriminator, $D_{\phi}$ accepts if it thinks it is from the true data distribution denoted as $p$.
  • Figure 3: Ablation study showing the effect of different values of $\alpha$ on test and train accuracy. Note that, when $\alpha=1$, the effect of MI reduction between $W$ and $S$ is large, but the accuracy also drops severely. When $\alpha=0.5$, the model is performing comparatively better on the mortality prediction task. Also, when $\alpha=0.0$, the MI reduction part in equation 8 is inactive and thus is the fairness constraint. According to this, we set $\alpha=0.5$ to balance the quality-fairness trade-off for the entire process.
  • Figure 4: Data utility analysis
  • Figure 5: Representation of sub-groups (The LDSS scores are given in brackets)
  • ...and 5 more figures