Language Models are Realistic Tabular Data Generators

Vadim Borisov; Kathrin Seßler; Tobias Leemann; Martin Pawelczyk; Gjergji Kasneci

Language Models are Realistic Tabular Data Generators

Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, Gjergji Kasneci

TL;DR

GReaT addresses the challenge of generating realistic synthetic tabular data by encoding rows as text and fine-tuning a pretrained auto-regressive LLM. The method introduces random feature order permutations to enable arbitrary conditioning and uses regex to extract tabular samples, avoiding heavy preprocessing. Across six real-world and three synthetic datasets, GReaT achieves state-of-the-art or competitive performance on multiple metrics, illustrating the value of leveraging large language models for heterogeneous tabular data. The approach offers a flexible, minimally preprocessing-intensive pipeline with open-source implementation, supporting privacy-preserving data sharing and broader applicability in downstream analytics.

Abstract

Tabular data is among the oldest and most ubiquitous forms of data. However, the generation of synthetic samples with the original data's characteristics remains a significant challenge for tabular data. While many generative models from the computer vision domain, such as variational autoencoders or generative adversarial networks, have been adapted for tabular data generation, less research has been directed towards recent transformer-based large language models (LLMs), which are also generative in nature. To this end, we propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative LLM to sample synthetic and yet highly realistic tabular data. Furthermore, GReaT can model tabular data distributions by conditioning on any subset of features; the remaining features are sampled without additional overhead. We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles. We find that GReaT maintains state-of-the-art performance across numerous real-world and synthetic data sets with heterogeneous feature types coming in various sizes.

Language Models are Realistic Tabular Data Generators

TL;DR

Abstract

Paper Structure (16 sections, 3 equations, 8 figures, 10 tables)

This paper contains 16 sections, 3 equations, 8 figures, 10 tables.

Introduction
Related Work
GReaT: Generation of Realistic Tabular Data
GReaT fine-tuning
GReaT sampling of synthetic data
Sampling and extraction of synthetic tabular data.
Experimental Evaluation
Conclusion
Discussion
Additional Experimental Results
Further MLE measures
Average Negative Log-Likelihood Metric for synthetic data
Distance to Closest Record Results
Additional Qualitative Analysis Results
Runtime Comparison
...and 1 more sections

Figures (8)

Figure 1: A comparison of the original and generated samples for the California Housing data set pace1997sparse, which contains characteristic information about different properties in California, USA. We show joint histogram plots of the highly interconnected variables Latitude and Longitude. The black outline indicates the true boundary of the state of California.
Figure 2: The GReaT data pipeline for the fine-tuning step. First, a textual encoding step transforms tabular data into meaningful text (a). Subsequently, a feature order permutation step is applied (b), before the obtained sentences can be used for the fine-tuning of a large language model LLM (c). The toy tabular data set inspired by the Adult Income data set Dua:2019.
Figure 3: The sampling procedure of the proposed method for synthetic data generation. In order to generate new data points using a pretrained LLM, it is necessary to transform either a single feature name or an arbitrary combination of feature-value pairs into text (a). Subsequently, the input is completed by the fine-tuned LLM (b) and can be transformed back into a tabular format (c). In comparison to other state-of-the-art approaches, GReaT allows arbitrary conditioning on feature subsets without model retraining, i.e., the sampling can be performed by conditioning on any feature name or combination of feature names and values.
Figure 4: Distance to closest record (DCR) distributions for the California Housing data set with respect to the original train set. "Original Test Data Set" shows the DCR between the original test set and the original train set. This experiment shows that the proposed method does not "copy" samples from the training set but rather generates new synthetic samples close to the original samples.
Figure 5: Distance to closest record (DCR) distributions for the HELOC Data set with respect to the original train set. "Original Test Data Set" shows the DCR between the original test set and the original train set. According to this experiment, the proposed method does not "copy" samples from the training set but rather generates new synthetic samples close to the original samples.
...and 3 more figures

Theorems & Definitions (2)

Definition 1: Textual encoder
Definition 2: Feature order permutation function

Language Models are Realistic Tabular Data Generators

TL;DR

Abstract

Language Models are Realistic Tabular Data Generators

Authors

TL;DR

Abstract

Table of Contents

Figures (8)

Theorems & Definitions (2)