Table of Contents
Fetching ...

Assessing Generative Models for Structured Data

Reilly Cannon, Nicolette M. Laird, Caesar Vazquez, Andy Lin, Amy Wagler, Tony Chiang

TL;DR

This work tackles the problem of directly evaluating synthetic tabular data by comparing real and synthetic inter-column dependencies across marginal, pairwise, and higher-order relationships. It introduces a distribution-focused framework based on cumulants, dependency networks, community structure, and higher-order statistics to assess synthetic data from GPT-2 (few-shot and fine-tuned) and CTGAN. Findings show that while marginal distributions can be well approximated, both LLMs and GANs struggle to reproduce pairwise and especially higher-order dependencies, underscoring limitations of current synthetic-data approaches. The framework provides a rigorous, task-agnostic tool to guide future improvements in synthetic tabular data generation with implications for privacy-sensitive domains.

Abstract

Synthetic tabular data generation has emerged as a promising method to address limited data availability and privacy concerns. With the sharp increase in the performance of large language models in recent years, researchers have been interested in applying these models to the generation of tabular data. However, little is known about the quality of the generated tabular data from large language models. The predominant method for assessing the quality of synthetic tabular data is the train-synthetic-test-real approach, where the artificial examples are compared to the original by how well machine learning models, trained separately on the real and synthetic sets, perform in some downstream tasks. This method does not directly measure how closely the distribution of generated data approximates that of the original. This paper introduces rigorous methods for directly assessing synthetic tabular data against real data by looking at inter-column dependencies within the data. We find that large language models (GPT-2), both when queried via few-shot prompting and when fine-tuned, and GAN (CTGAN) models do not produce data with dependencies that mirror the original real data. Results from this study can inform future practice in synthetic data generation to improve data quality.

Assessing Generative Models for Structured Data

TL;DR

This work tackles the problem of directly evaluating synthetic tabular data by comparing real and synthetic inter-column dependencies across marginal, pairwise, and higher-order relationships. It introduces a distribution-focused framework based on cumulants, dependency networks, community structure, and higher-order statistics to assess synthetic data from GPT-2 (few-shot and fine-tuned) and CTGAN. Findings show that while marginal distributions can be well approximated, both LLMs and GANs struggle to reproduce pairwise and especially higher-order dependencies, underscoring limitations of current synthetic-data approaches. The framework provides a rigorous, task-agnostic tool to guide future improvements in synthetic tabular data generation with implications for privacy-sensitive domains.

Abstract

Synthetic tabular data generation has emerged as a promising method to address limited data availability and privacy concerns. With the sharp increase in the performance of large language models in recent years, researchers have been interested in applying these models to the generation of tabular data. However, little is known about the quality of the generated tabular data from large language models. The predominant method for assessing the quality of synthetic tabular data is the train-synthetic-test-real approach, where the artificial examples are compared to the original by how well machine learning models, trained separately on the real and synthetic sets, perform in some downstream tasks. This method does not directly measure how closely the distribution of generated data approximates that of the original. This paper introduces rigorous methods for directly assessing synthetic tabular data against real data by looking at inter-column dependencies within the data. We find that large language models (GPT-2), both when queried via few-shot prompting and when fine-tuned, and GAN (CTGAN) models do not produce data with dependencies that mirror the original real data. Results from this study can inform future practice in synthetic data generation to improve data quality.

Paper Structure

This paper contains 19 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Marginal distributions. Violin plots showing the marginal distribution of a column from real and synthetic data. a) Marginal distribution of the "hours-per-week" column from the Adult dataset. b) Marginal distribution of the "worst concave points" column from the Breast Cancer dataset. c) Marginal distribution of the "V23" column from the Credit dataset. There is a separate plot for each of the 15 trials conducted for CTGAN, fine-tuned GPT-2, and few-shot prompted GPT-2. The real data distributions comes from the training set. In general, the synthetic data produced by fine-tuned GPT-2 most closely matched the real data. Violin plots for the remaining continuous columns can be found at Supplemental Fig. \ref{['fig:adult_violins']}--\ref{['fig:credit5_violins']}.
  • Figure 2: Difference in association effect size between real and synthetic data. A heatmap showing the difference in association between real and synthetic data created by (top left) resampling of the train set, (top right) CTGAN, (bottom left) fine-tuned GPT-2 fine-tuned, and (bottom right) few-shot prompted GPT-2 for pairs of features in the Adult dataset. The heatmaps are shown are the best quality data generated by each method over the $15$ trails according to the lowest absolute determinant of the dependency matrix. Heatmaps for the Breast Cancer and Credit dataset can be found in Supplementary Fig. \ref{['fig:breast_heatmap']} and \ref{['fig:credit_heatmap']}, respectively.
  • Figure 3: Representative network plots of the dependency between real and synthetic data for the Adults dataset: (a) Adult train set, (b) resampling of the train set, (c) CTGAN, and (d) GPT-2 (fine-tune). The node colors represent the clusters found using the Louvain community detection algorithm. The edge colors reflect the directionality of the relationships (red for negative correlations, green for positive correlations). Displayed models were chosen according to the lowest absolute determinant of the dependency matrix.
  • Figure 4: Comparison of higher-order cumulants. Plots of the largest $100$ third- and fourth-order cumulants from different combinations of models and datasets. Each row of panels represents a different dataset, and each column represents a different higher-order joint cumulant. For the synthetic data, the dots represent the average value over the $15$ runs of generating synthetic data for each model, and the shaded region denotes the min-max spread. In general, synthetically produced data fail to reproduce higher-order dependencies that result from real data.