Table of Contents
Fetching ...

Generating Realistic Tabular Data with Large Language Models

Dang Nguyen, Sunil Gupta, Kien Do, Thin Nguyen, Svetha Venkatesh

TL;DR

This work proposes a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data and proposes a feature-conditional sampling approach to generate synthetic samples.

Abstract

While most generative models show achievements in image data generation, few are developed for tabular data generation. Recently, due to success of large language models (LLM) in diverse tasks, they have also been used for tabular data generation. However, these methods do not capture the correct correlation between the features and the target variable, hindering their applications in downstream predictive tasks. To address this problem, we propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. First, we propose a novel permutation strategy for the input data in the fine-tuning phase. Second, we propose a feature-conditional sampling approach to generate synthetic samples. Finally, we generate the labels by constructing prompts based on the generated samples to query our fine-tuned LLM. Our extensive experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks. It also produces highly realistic synthetic samples in terms of quality and diversity. More importantly, classifiers trained with our synthetic data can even compete with classifiers trained with the original data on half of the benchmark datasets, which is a significant achievement in tabular data generation.

Generating Realistic Tabular Data with Large Language Models

TL;DR

This work proposes a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data and proposes a feature-conditional sampling approach to generate synthetic samples.

Abstract

While most generative models show achievements in image data generation, few are developed for tabular data generation. Recently, due to success of large language models (LLM) in diverse tasks, they have also been used for tabular data generation. However, these methods do not capture the correct correlation between the features and the target variable, hindering their applications in downstream predictive tasks. To address this problem, we propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. First, we propose a novel permutation strategy for the input data in the fine-tuning phase. Second, we propose a feature-conditional sampling approach to generate synthetic samples. Finally, we generate the labels by constructing prompts based on the generated samples to query our fine-tuned LLM. Our extensive experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks. It also produces highly realistic synthetic samples in terms of quality and diversity. More importantly, classifiers trained with our synthetic data can even compete with classifiers trained with the original data on half of the benchmark datasets, which is a significant achievement in tabular data generation.

Paper Structure

This paper contains 27 sections, 3 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Training and evaluation of a tabular generation method. Training phase: the tabular generation method learns to generate a synthetic dataset ${\cal D}_{fake}$ to approximate the real dataset ${\cal D}_{real}$. Evaluation phase:${\cal D}_{fake}$ is used to train a predictive model (e.g. XGBoost). The predictive model is evaluated on a held-out real test set ${\cal D}_{test}$ to compute accuracy or mean squared error (MSE). A better score implies a better tabular generation method.
  • Figure 2: The ground-truth feature-class correlation on the original dataset Cardiotocography (1st plot). Existing methods cannot capture this correlation on the synthetic data. Only our method Pred-LLM can capture it.
  • Figure 3: Our method Pred-LLM includes three phases: (1) fine-tuning a pre-trained LLM with the real dataset, (2) sampling synthetic samples conditioned on each feature, and (3) constructing prompts based on the generated data to query labels.
  • Figure 4: Attention visualization. Existing LLM-based methods permute both $X$ and $Y$. We call this strategy permute_xy. When the target variable "Income" is shuffled to the beginning of the sentence, there is no attention links between the features and the label "Low", as shown in (a). In contrast, we only permute $X$ while fixing $Y$ at the end of the sentence. We call our strategy permute_x. The attention mechanism of our LLM can learn some correlations between the label "Low" and other features "Age" and "Edu", as shown in (b).
  • Figure 5: The average quality and diversity scores over 20 datasets. $\downarrow$ means "lower is better". $\uparrow$ means "higher is better".
  • ...and 3 more figures