Table of Contents
Fetching ...

Improving Equity in Health Modeling with GPT4-Turbo Generated Synthetic Data: A Comparative Study

Daniel Smolyak, Arshana Welivita, Margrét V. Bjarnadóttir, Ritu Agarwal

TL;DR

This study tackles health equity in predictive modeling by proposing a GPT4-Turbo based pipeline to generate group-specific synthetic data for underrepresented demographic subgroups. It uses two public datasets, MIMIC-IV and Framingham OS/OMNI-1, and compares group-specific prompts, generic prompts, and standard augmentation baselines to assess downstream predictive performance with AUROC and AUPRC. Results show that synthetic data augmentation often improves performance for smaller groups, but gains from tailoring prompts to a specific subgroup are mixed and often small, highlighting the method as a complementary tool rather than a universal solution. The work also underscores risks of group-agnostic prompting and points to future directions such as retrieval-augmented generation and health-domain LLMs to better capture subgroup nuances in non-representative medical data.

Abstract

Objective. Demographic groups are often represented at different rates in medical datasets. These differences can create bias in machine learning algorithms, with higher levels of performance for better-represented groups. One promising solution to this problem is to generate synthetic data to mitigate potential adverse effects of non-representative data sets. Methods. We build on recent advances in LLM-based synthetic data generation to create a pipeline where the synthetic data is generated separately for each demographic group. We conduct our study using MIMIC-IV and Framingham "Offspring and OMNI-1 Cohorts" datasets. We prompt GPT4-Turbo to create group-specific data, providing training examples and the dataset context. An exploratory analysis is conducted to ascertain the quality of the generated data. We then evaluate the utility of the synthetic data for augmentation of a training dataset in a downstream machine learning task, focusing specifically on model performance metrics across groups. Results. The performance of GPT4-Turbo augmentation is generally superior but not always. In the majority of experiments our method outperforms standard modeling baselines, however, prompting GPT-4-Turbo to produce data specific to a group provides little to no additional benefit over a prompt that does not specify the group. Conclusion. We developed a method for using LLMs out-of-the-box to synthesize group-specific data to address imbalances in demographic representation in medical datasets. As another "tool in the toolbox", this method can improve model fairness and thus health equity. More research is needed to understand the conditions under which LLM generated synthetic data is useful for non-representative medical data sets.

Improving Equity in Health Modeling with GPT4-Turbo Generated Synthetic Data: A Comparative Study

TL;DR

This study tackles health equity in predictive modeling by proposing a GPT4-Turbo based pipeline to generate group-specific synthetic data for underrepresented demographic subgroups. It uses two public datasets, MIMIC-IV and Framingham OS/OMNI-1, and compares group-specific prompts, generic prompts, and standard augmentation baselines to assess downstream predictive performance with AUROC and AUPRC. Results show that synthetic data augmentation often improves performance for smaller groups, but gains from tailoring prompts to a specific subgroup are mixed and often small, highlighting the method as a complementary tool rather than a universal solution. The work also underscores risks of group-agnostic prompting and points to future directions such as retrieval-augmented generation and health-domain LLMs to better capture subgroup nuances in non-representative medical data.

Abstract

Objective. Demographic groups are often represented at different rates in medical datasets. These differences can create bias in machine learning algorithms, with higher levels of performance for better-represented groups. One promising solution to this problem is to generate synthetic data to mitigate potential adverse effects of non-representative data sets. Methods. We build on recent advances in LLM-based synthetic data generation to create a pipeline where the synthetic data is generated separately for each demographic group. We conduct our study using MIMIC-IV and Framingham "Offspring and OMNI-1 Cohorts" datasets. We prompt GPT4-Turbo to create group-specific data, providing training examples and the dataset context. An exploratory analysis is conducted to ascertain the quality of the generated data. We then evaluate the utility of the synthetic data for augmentation of a training dataset in a downstream machine learning task, focusing specifically on model performance metrics across groups. Results. The performance of GPT4-Turbo augmentation is generally superior but not always. In the majority of experiments our method outperforms standard modeling baselines, however, prompting GPT-4-Turbo to produce data specific to a group provides little to no additional benefit over a prompt that does not specify the group. Conclusion. We developed a method for using LLMs out-of-the-box to synthesize group-specific data to address imbalances in demographic representation in medical datasets. As another "tool in the toolbox", this method can improve model fairness and thus health equity. More research is needed to understand the conditions under which LLM generated synthetic data is useful for non-representative medical data sets.

Paper Structure

This paper contains 17 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Pipeline overview: first the data from the designated majority group and minority group are sampled, next the minority group data is either augmented with GPT4T-generated synthetic data (our approach), one of the baseline approaches is applied to the data, or no change is applied. The groups are then combined and used to train a logistic regression model. The performance of the model is evaluated for each group on the held-out testing data.
  • Figure 2: (left) Kernel density estimation for the joint distribution of diastolic and systolic blood pressure in (left top) the Black-tailored GPT4T-generated data versus the real MIMIC-IV data for Black patients and (left bottom) Hispanic-tailored GPT4T-generated data versus the real Framingham data for Hispanic participants. (top center) The correlations plot for previous hospital, emergency department, and ICU admissions in the original MIMIC-IV data and (top right) the GPT4T-generated data. (bottom center) The correlations plot for several heart health related variables in the original Framingham data and (bottom right) the GPT4T-generated data.
  • Figure 3: (left) MIMIC-IV: Distribution of the distances between each GPT4T generated data point (with a Black-specific prompt) and the nearest sample of real Black patients and real White patients. (right) Framingham: Distribution of the distances between each GPT4T generated data point (with a Hispanic-specific prompt) and the nearest sample of real Hispanic patients and real White patients.
  • Figure 4: Density of predicted probabilities for belonging to the minority racial group for the synthetic data (blue), each minority group (orange), and the majority group (green) in the MIMIC-IV data (left) and the Framingham data (right).
  • Figure :