Table of Contents
Fetching ...

Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation

Austin A. Barr, Robert Rozman, Eddie Guo

TL;DR

The paper tackles the challenge of generating high-quality synthetic tabular data without access to real-world data or model pre-training. It introduces a zero-shot framework using the GPT-4o LLM and benchmarks it against CTGAN on three open datasets, evaluating fidelity and privacy with SDMetrics. Results show GPT-4o consistently preserves means, 95% confidence intervals, and bivariate correlations better than CTGAN, while distributional characteristics vary by dataset and require refinement. The approach highlights a scalable, accessible alternative to GAN-based synthesis that can augment data and support ML training, with future work needed to improve distributional fidelity and assess downstream utility.

Abstract

We propose a new framework for zero-shot generation of synthetic tabular data. Using the large language model (LLM) GPT-4o and plain-language prompting, we demonstrate the ability to generate high-fidelity tabular data without task-specific fine-tuning or access to real-world data (RWD) for pre-training. To benchmark GPT-4o, we compared the fidelity and privacy of LLM-generated synthetic data against data generated with the conditional tabular generative adversarial network (CTGAN), across three open-access datasets: Iris, Fish Measurements, and Real Estate Valuation. Despite the zero-shot approach, GPT-4o outperformed CTGAN in preserving means, 95% confidence intervals, bivariate correlations, and data privacy of RWD, even at amplified sample sizes. Notably, correlations between parameters were consistently preserved with appropriate direction and strength. However, refinement is necessary to better retain distributional characteristics. These findings highlight the potential of LLMs in tabular data synthesis, offering an accessible alternative to generative adversarial networks and variational autoencoders.

Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation

TL;DR

The paper tackles the challenge of generating high-quality synthetic tabular data without access to real-world data or model pre-training. It introduces a zero-shot framework using the GPT-4o LLM and benchmarks it against CTGAN on three open datasets, evaluating fidelity and privacy with SDMetrics. Results show GPT-4o consistently preserves means, 95% confidence intervals, and bivariate correlations better than CTGAN, while distributional characteristics vary by dataset and require refinement. The approach highlights a scalable, accessible alternative to GAN-based synthesis that can augment data and support ML training, with future work needed to improve distributional fidelity and assess downstream utility.

Abstract

We propose a new framework for zero-shot generation of synthetic tabular data. Using the large language model (LLM) GPT-4o and plain-language prompting, we demonstrate the ability to generate high-fidelity tabular data without task-specific fine-tuning or access to real-world data (RWD) for pre-training. To benchmark GPT-4o, we compared the fidelity and privacy of LLM-generated synthetic data against data generated with the conditional tabular generative adversarial network (CTGAN), across three open-access datasets: Iris, Fish Measurements, and Real Estate Valuation. Despite the zero-shot approach, GPT-4o outperformed CTGAN in preserving means, 95% confidence intervals, bivariate correlations, and data privacy of RWD, even at amplified sample sizes. Notably, correlations between parameters were consistently preserved with appropriate direction and strength. However, refinement is necessary to better retain distributional characteristics. These findings highlight the potential of LLMs in tabular data synthesis, offering an accessible alternative to generative adversarial networks and variational autoencoders.

Paper Structure

This paper contains 18 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Prompt used to generate synthetic Fish Measurements data (n = 159).
  • Figure 2: 95% confidence intervals compared between the real Iris data and the GPT-4o and CTGAN-generated datasets at the same (n = 150) and amplified (n = 1000) sample size.
  • Figure 3: Heatmap comparison of Pearson’s product moment correlations of all bivariate relationships for the (A) real Iris dataset, (B) GPT-4o synthetic (n = 150) dataset, (C) GPT-4o synthetic (n = 1000) dataset, (D) CTGAN synthetic (n = 150) dataset, (E) CTGAN synthetic (n = 1000) dataset.
  • Figure 4: 95% confidence intervals compared between the real Fish Measurements data and the GPT-4o and CTGAN-generated datasets at the same (n = 159) and amplified (n = 1000) sample size.
  • Figure 5: Heatmap comparison of Pearson’s product moment correlations of all bivariate relationships for the (A) real Fish Measurements dataset, (B) GPT-4o synthetic (n = 159) dataset, (C) GPT-4o synthetic (n = 1000) dataset, (D) CTGAN synthetic (n = 159) dataset, (E) CTGAN synthetic (n = 1000) dataset.
  • ...and 2 more figures