MedEqualizer: A Framework Investigating Bias in Synthetic Medical Data and Mitigation via Augmentation
Sama Salarian, Yue Zhang, Swati Padhee, Srinivasan Parthasarathy
TL;DR
This paper addresses fairness in synthetic healthcare data by evaluating representational bias across protected attributes when using GAN-based data generators. It introduces MedEqualizer, a model-agnostic augmentation framework that targets underrepresented subgroups before synthetic data generation, using conditional generation, two-stage filtering, and intersectional analysis. The approach builds on MedGAN, CTGAN, HealthGAN, and Chameleon components, and leverages the Logarithmic Disparity metric to quantify bias on MIMIC-III data, demonstrating improved subgroup balance after augmentation. The results show meaningful reductions in highly overrepresented and underrepresented subgroups and increases in equitably represented subgroups, suggesting that fairness-aware augmentation can yield more representative synthetic healthcare data suitable for downstream research and development.
Abstract
Synthetic healthcare data generation presents a viable approach to enhance data accessibility and support research by overcoming limitations associated with real-world medical datasets. However, ensuring fairness across protected attributes in synthetic data is critical to avoid biased or misleading results in clinical research and decision-making. In this study, we assess the fairness of synthetic data generated by multiple generative adversarial network (GAN)-based models using the MIMIC-III dataset, with a focus on representativeness across protected demographic attributes. We measure subgroup representation using the logarithmic disparity metric and observe significant imbalances, with many subgroups either underrepresented or overrepresented in the synthetic data, compared to the real data. To mitigate these disparities, we introduce MedEqualizer, a model-agnostic augmentation framework that enriches the underrepresented subgroups prior to synthetic data generation. Our results show that MedEqualizer significantly improves demographic balance in the resulting synthetic datasets, offering a viable path towards more equitable and representative healthcare data synthesis.
