Table of Contents
Fetching ...

Zero-shot generation of synthetic neurosurgical data with large language models

Austin A. Barr, Eddie Guo, Emre Sezgin

TL;DR

This work tackles limited access to real-world neurosurgical data by evaluating zero-shot synthetic data generation with GPT-4o and benchmarking it against CTGAN. Using a small, real-world Milan neurosurgical dataset, the authors generate synthetic data without pre-training, amplify sample sizes, and assess fidelity, utility, and privacy via SDMetrics and a TSTR classifier. GPT-4o generally matches or surpasses CTGAN on fidelity metrics and delivers comparable ML utility, while maintaining strong privacy signals and avoiding exact row memorization. The results support the potential of LLM-based synthetic data to augment small clinical datasets for outcome prediction, though further validation across more features and settings is needed to solidify distributional fidelity and predictive performance.

Abstract

Clinical data is fundamental to advance neurosurgical research, but access is often constrained by data availability, small sample sizes, privacy regulations, and resource-intensive preprocessing and de-identification procedures. Synthetic data offers a potential solution to challenges associated with accessing and using real-world data (RWD). This study aims to evaluate the capability of zero-shot generation of synthetic neurosurgical data with a large language model (LLM), GPT-4o, by benchmarking with the conditional tabular generative adversarial network (CTGAN). Synthetic datasets were compared to real-world neurosurgical data to assess fidelity (means, proportions, distributions, and bivariate correlations), utility (ML classifier performance on RWD), and privacy (duplication of records from RWD). The GPT-4o-generated datasets matched or exceeded CTGAN performance, despite no fine-tuning or access to RWD for pre-training. Datasets demonstrated high univariate and bivariate fidelity to RWD without directly exposing any real patient records, even at amplified sample size. Training an ML classifier on GPT-4o-generated data and testing on RWD for a binary prediction task showed an F1 score (0.706) with comparable performance to training on the CTGAN data (0.705) for predicting postoperative functional status deterioration. GPT-4o demonstrated a promising ability to generate high-fidelity synthetic neurosurgical data. These findings also indicate that data synthesized with GPT-4o can effectively augment clinical data with small sample sizes, and train ML models for prediction of neurosurgical outcomes. Further investigation is necessary to improve the preservation of distributional characteristics and boost classifier performance.

Zero-shot generation of synthetic neurosurgical data with large language models

TL;DR

This work tackles limited access to real-world neurosurgical data by evaluating zero-shot synthetic data generation with GPT-4o and benchmarking it against CTGAN. Using a small, real-world Milan neurosurgical dataset, the authors generate synthetic data without pre-training, amplify sample sizes, and assess fidelity, utility, and privacy via SDMetrics and a TSTR classifier. GPT-4o generally matches or surpasses CTGAN on fidelity metrics and delivers comparable ML utility, while maintaining strong privacy signals and avoiding exact row memorization. The results support the potential of LLM-based synthetic data to augment small clinical datasets for outcome prediction, though further validation across more features and settings is needed to solidify distributional fidelity and predictive performance.

Abstract

Clinical data is fundamental to advance neurosurgical research, but access is often constrained by data availability, small sample sizes, privacy regulations, and resource-intensive preprocessing and de-identification procedures. Synthetic data offers a potential solution to challenges associated with accessing and using real-world data (RWD). This study aims to evaluate the capability of zero-shot generation of synthetic neurosurgical data with a large language model (LLM), GPT-4o, by benchmarking with the conditional tabular generative adversarial network (CTGAN). Synthetic datasets were compared to real-world neurosurgical data to assess fidelity (means, proportions, distributions, and bivariate correlations), utility (ML classifier performance on RWD), and privacy (duplication of records from RWD). The GPT-4o-generated datasets matched or exceeded CTGAN performance, despite no fine-tuning or access to RWD for pre-training. Datasets demonstrated high univariate and bivariate fidelity to RWD without directly exposing any real patient records, even at amplified sample size. Training an ML classifier on GPT-4o-generated data and testing on RWD for a binary prediction task showed an F1 score (0.706) with comparable performance to training on the CTGAN data (0.705) for predicting postoperative functional status deterioration. GPT-4o demonstrated a promising ability to generate high-fidelity synthetic neurosurgical data. These findings also indicate that data synthesized with GPT-4o can effectively augment clinical data with small sample sizes, and train ML models for prediction of neurosurgical outcomes. Further investigation is necessary to improve the preservation of distributional characteristics and boost classifier performance.

Paper Structure

This paper contains 16 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Prompt inputted into GPT-4o to generate synthetic datasets. The prompt was inputted over 10 independent trials to generate synthetic datasets (n = 139). To generate an amplified dataset, the prompt was modified from “139 patients” to “1390 patients”, with all other aspects of the prompt remaining identical.
  • Figure 2: 95% confidence interval (CI) overlap of continuous parameters between real and synthetic datasets. Error bars display 95% CIs for age, postoperative length of stay, and body mass index for the real-world data (RWD), best (BG) and worst (WG) generations from GPT-4o (n = 139) datasets, CTGAN (n = 139) dataset, and ten-fold amplified datasets (n = 1390) generated with GPT-4o and CTGAN (10x). The GPT-4o (n = 139) datasets were ranked in order of 95% CI overlap percentage to determine BG and WG. The 95% CI overlap percentages were not included for 10x datasets due to smaller 95% CIs with larger sample size.
  • Figure 3: Proportional alignment of ordinal and binary parameters between real and synthetic datasets. Stacked bar plots display proportions of sex, Karnofsky performance status (KPS) deterioration at discharge, heart disease, previous brain radiotherapy, Milan complexity scale, American Society of Anesthesiologists (ASA) physical status classification, diabetes, and Landriel scale for the real-world data (RWD), best (BG) and worst (WG) generations from GPT-4o (n = 139) datasets, CTGAN (n = 139) dataset, and ten-fold amplified datasets (n = 1390) generated with GPT-4o and CTGAN (10x). The GPT-4o (n = 139) datasets were ranked in order of TVComplement score to determine BG and WG.
  • Figure 4: Confusion matrices for the binary classification task trained on amplified synthetic data and tested on real data. Binary classification used the BinaryAdaBoostClassifier to predict Karnofsky performance status (KPS) deterioration at discharge. The confusion matrices display performance of the (left) model trained on the GPT-4o (n = 1390) dataset and (right) model trained on the CTGAN (n = 1390) dataset.