Table of Contents
Fetching ...

FairTabGen: High-Fidelity and Fair Synthetic Health Data Generation from Limited Samples

Nitish Nagesh, Salar Shakibhamedan, Mahdi Bagheri, Ziyu Wang, Nima TaheriNejad, Axel Jantsch, Amir M. Rahmani

TL;DR

This work tackles the challenge of deriving fair, high-utility synthetic health data from limited samples to enable responsible analytics under privacy constraints. It introduces FairTabGen, an LLM-driven, prompt-based framework that curates seeds, enforces utility and fairness constraints, and evaluates via Train Synthetic, Test Real (TSTR) on the MIMIC-IV dataset. The methodology emphasizes a modular pipeline (data processor, generator, evaluator, bias mitigator) and uses metrics such as FTU, DP, ABROCA, TPRD, and ERD to quantify fairness, with empirical results showing substantial data efficiency and improved fairness—especially after pre-processing bias mitigation. The work demonstrates practical potential for fair, privacy-preserving healthcare research while acknowledging limitations like demographic distribution skew and reproducibility challenges with black-box LLMs, outlining directions toward open-source alternatives and larger-scale validation.

Abstract

Synthetic healthcare data generation offers a promising solution to research limitations in clinical settings caused by privacy and regulatory constraints. However, current synthetic data generation approaches require specialized knowledge about training generative models and require high computational resources. In this paper, we propose FairTabGen, an LLM-based tabular data generation framework that produces high-quality synthetic healthcare data using only a small subset of the original dataset. Our method combines in-context learning, prompt curation and embedding structural constraints for data synthesis. We evaluate performance on MIMIC-IV dataset. Our method using 99% less data and achieving 50% improvement for fairness through unawareness while maintaining competitive predictive utility. However, we observe data distribution of racial groups is skewed affecting demographic parity. We thereafter apply bias mitigation algorithms in the pre-processing stage, improving overall fairness by 10% highlighting effectiveness of our approach.

FairTabGen: High-Fidelity and Fair Synthetic Health Data Generation from Limited Samples

TL;DR

This work tackles the challenge of deriving fair, high-utility synthetic health data from limited samples to enable responsible analytics under privacy constraints. It introduces FairTabGen, an LLM-driven, prompt-based framework that curates seeds, enforces utility and fairness constraints, and evaluates via Train Synthetic, Test Real (TSTR) on the MIMIC-IV dataset. The methodology emphasizes a modular pipeline (data processor, generator, evaluator, bias mitigator) and uses metrics such as FTU, DP, ABROCA, TPRD, and ERD to quantify fairness, with empirical results showing substantial data efficiency and improved fairness—especially after pre-processing bias mitigation. The work demonstrates practical potential for fair, privacy-preserving healthcare research while acknowledging limitations like demographic distribution skew and reproducibility challenges with black-box LLMs, outlining directions toward open-source alternatives and larger-scale validation.

Abstract

Synthetic healthcare data generation offers a promising solution to research limitations in clinical settings caused by privacy and regulatory constraints. However, current synthetic data generation approaches require specialized knowledge about training generative models and require high computational resources. In this paper, we propose FairTabGen, an LLM-based tabular data generation framework that produces high-quality synthetic healthcare data using only a small subset of the original dataset. Our method combines in-context learning, prompt curation and embedding structural constraints for data synthesis. We evaluate performance on MIMIC-IV dataset. Our method using 99% less data and achieving 50% improvement for fairness through unawareness while maintaining competitive predictive utility. However, we observe data distribution of racial groups is skewed affecting demographic parity. We thereafter apply bias mitigation algorithms in the pre-processing stage, improving overall fairness by 10% highlighting effectiveness of our approach.

Paper Structure

This paper contains 25 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Proposed synthetic data generation architecture. Data is first curated in the data processor, then supplied to the generator which gives the desired output. The evaluator evaluates performance and fairness. Finally, the bias mitigator module runs multiple techniques which are evaluated based on closeness to original value.
  • Figure 2: Distributional Drift Analysis. Bars represent the signed deviation from the real-world baseline ($\Delta = \text{Method} - \text{Real}$). Subplot (d) illustrates the Composite Fairness Deviation, where FairTabGen consistently maintains the highest stability (closest to zero) across all mitigation strategies compared to state-of-the-art baselines.