Table of Contents
Fetching ...

ReTabSyn: Realistic Tabular Data Synthesis via Reinforcement Learning

Xiaofeng Lin, Seungbae Kim, Zhuoya Li, Zachary DeSoto, Charles Fleming, Guang Cheng

TL;DR

This work empirically fine-tune a language model-based generator using this approach, and across benchmarks with small sample sizes, class imbalance, and distribution shift, ReTabSyn consistently outperforms state-of-the-art baselines.

Abstract

Deep generative models can help with data scarcity and privacy by producing synthetic training data, but they struggle in low-data, imbalanced tabular settings to fully learn the complex data distribution. We argue that striving for the full joint distribution could be overkill; for greater data efficiency, models should prioritize learning the conditional distribution $P(y\mid \bm{X})$, as suggested by recent theoretical analysis. Therefore, we overcome this limitation with \textbf{ReTabSyn}, a \textbf{Re}inforced \textbf{Tab}ular \textbf{Syn}thesis pipeline that provides direct feedback on feature correlation preservation during synthesizer training. This objective encourages the generator to prioritize the most useful predictive signals when training data is limited, thereby strengthening downstream model utility. We empirically fine-tune a language model-based generator using this approach, and across benchmarks with small sample sizes, class imbalance, and distribution shift, ReTabSyn consistently outperforms state-of-the-art baselines. Moreover, our approach can be readily extended to control various aspects of synthetic tabular data, such as applying expert-specified constraints on generated observations.

ReTabSyn: Realistic Tabular Data Synthesis via Reinforcement Learning

TL;DR

This work empirically fine-tune a language model-based generator using this approach, and across benchmarks with small sample sizes, class imbalance, and distribution shift, ReTabSyn consistently outperforms state-of-the-art baselines.

Abstract

Deep generative models can help with data scarcity and privacy by producing synthetic training data, but they struggle in low-data, imbalanced tabular settings to fully learn the complex data distribution. We argue that striving for the full joint distribution could be overkill; for greater data efficiency, models should prioritize learning the conditional distribution , as suggested by recent theoretical analysis. Therefore, we overcome this limitation with \textbf{ReTabSyn}, a \textbf{Re}inforced \textbf{Tab}ular \textbf{Syn}thesis pipeline that provides direct feedback on feature correlation preservation during synthesizer training. This objective encourages the generator to prioritize the most useful predictive signals when training data is limited, thereby strengthening downstream model utility. We empirically fine-tune a language model-based generator using this approach, and across benchmarks with small sample sizes, class imbalance, and distribution shift, ReTabSyn consistently outperforms state-of-the-art baselines. Moreover, our approach can be readily extended to control various aspects of synthetic tabular data, such as applying expert-specified constraints on generated observations.
Paper Structure (40 sections, 1 theorem, 5 equations, 5 figures, 12 tables)

This paper contains 40 sections, 1 theorem, 5 equations, 5 figures, 12 tables.

Key Result

Theorem 1

Assuming loss $\ell$ is Lipschitz and bounded, the utility gap is bounded by two distinct error terms: where $d_{\mathcal{F}}$ is an integral probability metric measuring the distance between feature marginals, and constants $C_1, C_2$ depend on the hypothesis class $\mathcal{F}$.

Figures (5)

  • Figure 1: Illustration of the challenging scenario for tabular generator: in scenarios with limited or imbalanced training data, tabular generators may produce synthetic datasets containing unrealistic entries, potentially degrading the performance of downstream machine learning tasks.
  • Figure 2: Overall workflow of ReTabSyn. Starting from scarce, imbalanced real data, we first train a base tabular generator $\pi_{\mathrm{ref}}$ via supervised learning. Guided by the utility decomposition (Thm. 3.1), we construct oracle-free preference-labeled tuples by label/target perturbation: for each row, we keep the conditioning context (prompt) fixed and create a chosen tuple with the original label and a rejected tuple with a perturbed label, forming prompt--chosen--rejected training pairs. Finally, we fine-tune the generator with DPO, using $\pi_{\mathrm{ref}}$ as the reference policy, to enlarge the likelihood margin between chosen and rejected tuples, improving feature--target alignment and downstream ML utility.
  • Figure 3: AUROC scores of synthetic data-trained models on in-distribution test sets, across varying training set sizes. The left panel shows performance on pure synthetic data, and the right panel on real data augmented with synthetic data.
  • Figure 4: Correlation similarity matrix visualized using a color scale. Shades closer to red indicate high similarity, meaning the correlation between real and synthetic data is well preserved. In contrast, shades of blue signify low similarity, suggesting a weaker alignment between real and synthetic feature correlation.
  • Figure 5: Effect of varying the proportion $\rho$ of DPO steps relative to total fine-tuning steps. As $\rho$ increases, fidelity remains stable while utility (ROC AUC) consistently improves.

Theorems & Definitions (1)

  • Theorem 1: Utility Decomposition xu2023utility