Why LLMs Are Bad at Synthetic Table Generation (and what to do about it)

Shengzhe Xu; Cho-Ting Lee; Mandar Sharma; Raquib Bin Yousuf; Nikhil Muralidhar; Naren Ramakrishnan

Why LLMs Are Bad at Synthetic Table Generation (and what to do about it)

Shengzhe Xu, Cho-Ting Lee, Mandar Sharma, Raquib Bin Yousuf, Nikhil Muralidhar, Naren Ramakrishnan

TL;DR

This work tackles the problem that autoregressive LLMs struggle to generate faithful synthetic tabular data due to ignoring functional dependencies (FDs). It introduces Permutation-Aided Fine-Tuning (PAFT), which learns an FD-based dependency graph, distills complex FDs into actionable edges, and optimizes a permutation of columns to govern the generation order, thereby better approximating the joint distribution $P(\mathcal{A})$. Through six real datasets and rigorous evaluation across conditional distributions, domain-consistency, data-sniff tests, and downstream ML replacement tasks, PAFT demonstrates superior fidelity and realism over strong baselines, while revealing the limitations of relying solely on univariate or simple correlation metrics. The approach shows practical impact by enabling more reliable synthetic data for ML pipelines, and it highlights that even with newer LLMs, targeted calibration via FD-aware permutation remains essential for high-quality synthetic tabular data.

Abstract

Synthetic data generation is integral to ML pipelines, e.g., to augment training data, replace sensitive information, and even to power advanced platforms like DeepSeek. While LLMs fine-tuned for synthetic data generation are gaining traction, synthetic table generation -- a critical data type in business and science -- remains under-explored compared to text and image synthesis. This paper shows that LLMs, whether used as-is or after traditional fine-tuning, are inadequate for generating synthetic tables. Their autoregressive nature, combined with random order permutation during fine-tuning, hampers the modeling of functional dependencies and prevents capturing conditional mixtures of distributions essential for real-world constraints. We demonstrate that making LLMs permutation-aware can mitigate these issues.

Why LLMs Are Bad at Synthetic Table Generation (and what to do about it)

TL;DR

. Through six real datasets and rigorous evaluation across conditional distributions, domain-consistency, data-sniff tests, and downstream ML replacement tasks, PAFT demonstrates superior fidelity and realism over strong baselines, while revealing the limitations of relying solely on univariate or simple correlation metrics. The approach shows practical impact by enabling more reliable synthetic data for ML pipelines, and it highlights that even with newer LLMs, targeted calibration via FD-aware permutation remains essential for high-quality synthetic tabular data.

Abstract

Paper Structure (16 sections, 4 equations, 4 figures, 6 tables)

This paper contains 16 sections, 4 equations, 4 figures, 6 tables.

Introduction
Related Work
Challenges to Synthetic Table Generation in the Current LLM Paradigm
PAFT : Permutation-Aided Fine-Tuning
Tabular Data Generation with LLMs
Discovery and Distillation of Functional Dependencies (FD)
Putting It All Together
Synthetic Data Generation using PAFT
Experimental Evaluation
RQ1: Does PAFT-generated synthetic data accurately capture conditional distributions within categories?
RQ2: Does PAFT generate data respecting the consistency of intrinsic data characteristics?
RQ3: Does the synthetic data generated by PAFT pass the sniff test?
RQ4: Can data generated by PAFT replace real data in downstream ML model training?
RQ5: Do the data sets generated by PAFT adhere to real distributions and possess mode diversity?
RQ6: Do newer generations of LLMs obviate the need for PAFT ?
...and 1 more sections

Figures (4)

Figure 1: Overview of the proposed Permutation-Aided Fine-tuning (PAFT ) approach.
Figure 2: FD-Distillation and Dependency Graph Sorting for automatically extracting order permutations from tables.
Figure 3: For a composite dataset, US-locations, this comparison examines state-specific violation rates across different synthetic data generation approaches. The error bars represent standard deviation. The states on the x-axis are ordered by decreasing violation rates. PAFT significantly reduces state-specific violations in the composite dataset.
Figure 4: Column distributions visualization for each dataset generated by CTGAN, CopulaGAN, GReaT, and PAFT . The top row displays examples of numerical columns, while the bottom row presents examples of categorical columns. Overall, PAFT (Blue) has the closest distribution to real data (Red) compared to other synthesis methods. PAFT also showcase the ability to generate a wide range of diversity.

Theorems & Definitions (1)

Definition 1: Schema-Level FD

Why LLMs Are Bad at Synthetic Table Generation (and what to do about it)

TL;DR

Abstract

Why LLMs Are Bad at Synthetic Table Generation (and what to do about it)

Authors

TL;DR

Abstract

Table of Contents

Figures (4)

Theorems & Definitions (1)