FairTabGen: High-Fidelity and Fair Synthetic Health Data Generation from Limited Samples

Nitish Nagesh; Salar Shakibhamedan; Mahdi Bagheri; Ziyu Wang; Nima TaheriNejad; Axel Jantsch; Amir M. Rahmani

FairTabGen: High-Fidelity and Fair Synthetic Health Data Generation from Limited Samples

Nitish Nagesh, Salar Shakibhamedan, Mahdi Bagheri, Ziyu Wang, Nima TaheriNejad, Axel Jantsch, Amir M. Rahmani

TL;DR

This work tackles the challenge of deriving fair, high-utility synthetic health data from limited samples to enable responsible analytics under privacy constraints. It introduces FairTabGen, an LLM-driven, prompt-based framework that curates seeds, enforces utility and fairness constraints, and evaluates via Train Synthetic, Test Real (TSTR) on the MIMIC-IV dataset. The methodology emphasizes a modular pipeline (data processor, generator, evaluator, bias mitigator) and uses metrics such as FTU, DP, ABROCA, TPRD, and ERD to quantify fairness, with empirical results showing substantial data efficiency and improved fairness—especially after pre-processing bias mitigation. The work demonstrates practical potential for fair, privacy-preserving healthcare research while acknowledging limitations like demographic distribution skew and reproducibility challenges with black-box LLMs, outlining directions toward open-source alternatives and larger-scale validation.

Abstract

Synthetic healthcare data generation offers a promising solution to research limitations in clinical settings caused by privacy and regulatory constraints. However, current synthetic data generation approaches require specialized knowledge about training generative models and require high computational resources. In this paper, we propose FairTabGen, an LLM-based tabular data generation framework that produces high-quality synthetic healthcare data using only a small subset of the original dataset. Our method combines in-context learning, prompt curation and embedding structural constraints for data synthesis. We evaluate performance on MIMIC-IV dataset. Our method using 99% less data and achieving 50% improvement for fairness through unawareness while maintaining competitive predictive utility. However, we observe data distribution of racial groups is skewed affecting demographic parity. We thereafter apply bias mitigation algorithms in the pre-processing stage, improving overall fairness by 10% highlighting effectiveness of our approach.

FairTabGen: High-Fidelity and Fair Synthetic Health Data Generation from Limited Samples

TL;DR

Abstract

FairTabGen: High-Fidelity and Fair Synthetic Health Data Generation from Limited Samples

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)