Table of Contents
Fetching ...

MedEqualizer: A Framework Investigating Bias in Synthetic Medical Data and Mitigation via Augmentation

Sama Salarian, Yue Zhang, Swati Padhee, Srinivasan Parthasarathy

TL;DR

This paper addresses fairness in synthetic healthcare data by evaluating representational bias across protected attributes when using GAN-based data generators. It introduces MedEqualizer, a model-agnostic augmentation framework that targets underrepresented subgroups before synthetic data generation, using conditional generation, two-stage filtering, and intersectional analysis. The approach builds on MedGAN, CTGAN, HealthGAN, and Chameleon components, and leverages the Logarithmic Disparity metric to quantify bias on MIMIC-III data, demonstrating improved subgroup balance after augmentation. The results show meaningful reductions in highly overrepresented and underrepresented subgroups and increases in equitably represented subgroups, suggesting that fairness-aware augmentation can yield more representative synthetic healthcare data suitable for downstream research and development.

Abstract

Synthetic healthcare data generation presents a viable approach to enhance data accessibility and support research by overcoming limitations associated with real-world medical datasets. However, ensuring fairness across protected attributes in synthetic data is critical to avoid biased or misleading results in clinical research and decision-making. In this study, we assess the fairness of synthetic data generated by multiple generative adversarial network (GAN)-based models using the MIMIC-III dataset, with a focus on representativeness across protected demographic attributes. We measure subgroup representation using the logarithmic disparity metric and observe significant imbalances, with many subgroups either underrepresented or overrepresented in the synthetic data, compared to the real data. To mitigate these disparities, we introduce MedEqualizer, a model-agnostic augmentation framework that enriches the underrepresented subgroups prior to synthetic data generation. Our results show that MedEqualizer significantly improves demographic balance in the resulting synthetic datasets, offering a viable path towards more equitable and representative healthcare data synthesis.

MedEqualizer: A Framework Investigating Bias in Synthetic Medical Data and Mitigation via Augmentation

TL;DR

This paper addresses fairness in synthetic healthcare data by evaluating representational bias across protected attributes when using GAN-based data generators. It introduces MedEqualizer, a model-agnostic augmentation framework that targets underrepresented subgroups before synthetic data generation, using conditional generation, two-stage filtering, and intersectional analysis. The approach builds on MedGAN, CTGAN, HealthGAN, and Chameleon components, and leverages the Logarithmic Disparity metric to quantify bias on MIMIC-III data, demonstrating improved subgroup balance after augmentation. The results show meaningful reductions in highly overrepresented and underrepresented subgroups and increases in equitably represented subgroups, suggesting that fairness-aware augmentation can yield more representative synthetic healthcare data suitable for downstream research and development.

Abstract

Synthetic healthcare data generation presents a viable approach to enhance data accessibility and support research by overcoming limitations associated with real-world medical datasets. However, ensuring fairness across protected attributes in synthetic data is critical to avoid biased or misleading results in clinical research and decision-making. In this study, we assess the fairness of synthetic data generated by multiple generative adversarial network (GAN)-based models using the MIMIC-III dataset, with a focus on representativeness across protected demographic attributes. We measure subgroup representation using the logarithmic disparity metric and observe significant imbalances, with many subgroups either underrepresented or overrepresented in the synthetic data, compared to the real data. To mitigate these disparities, we introduce MedEqualizer, a model-agnostic augmentation framework that enriches the underrepresented subgroups prior to synthetic data generation. Our results show that MedEqualizer significantly improves demographic balance in the resulting synthetic datasets, offering a viable path towards more equitable and representative healthcare data synthesis.

Paper Structure

This paper contains 26 sections, 1 equation, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: MedEqualizer Workflow
  • Figure 2: Histograms showing the representation of all demographic subgroup combinations (age, race, and gender) in synthetic data compared to real data. For each model: (a) MedGAN, (b) HealthGAN, and (c) CTGAN. The left bars correspond to data generated without augmentation, and the right bars correspond to data augmented using MedEqualizer.
  • Figure 3: Subgroup representativeness in MIMIC-III synthetic data generated by MedGAN (a) before and (b) after data augmentation.
  • Figure 4: Subgroup representativeness in MIMIC-III synthetic data generated by HealthGAN (a) before and (b) after data augmentation.
  • Figure 5: Subgroup representativeness in MIMIC-III synthetic data generated by CTGAN (a) before and (b) after data augmentation.