REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers
Aivin V. Solatorio, Olivier Dupriez
TL;DR
<3-5 sentence high-level summary> This paper introduces REaLTabFormer, a transformer-based framework for synthesizing realistic non-relational and relational tabular data by combining a GPT-2 autoregressive model for the parent table with a conditioned Seq2Seq child-generator. It adds privacy- and quality-focused mechanisms, including target masking, distance-to-closest-record (DCR) based overfitting detection, and a bootstrap-derived Q_delta statistic for early stopping, to mitigate data copying while preserving realism. The authors demonstrate strong predictive utility on large non-relational datasets without task-specific tuning and show improved relational structure capture compared with baselines on Rossmann and Airbnb data. They also release an open-source Python package to facilitate adoption and further research in synthetic tabular data generation and privacy-aware sampling.
Abstract
Tabular data is a common form of organizing data. Multiple models are available to generate synthetic tabular datasets where observations are independent, but few have the ability to produce relational datasets. Modeling relational data is challenging as it requires modeling both a "parent" table and its relationships across tables. We introduce REaLTabFormer (Realistic Relational and Tabular Transformer), a tabular and relational synthetic data generation model. It first creates a parent table using an autoregressive GPT-2 model, then generates the relational dataset conditioned on the parent table using a sequence-to-sequence (Seq2Seq) model. We implement target masking to prevent data copying and propose the $Q_δ$ statistic and statistical bootstrapping to detect overfitting. Experiments using real-world datasets show that REaLTabFormer captures the relational structure better than a baseline model. REaLTabFormer also achieves state-of-the-art results on prediction tasks, "out-of-the-box", for large non-relational datasets without needing fine-tuning.
