Table of Contents
Fetching ...

REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers

Aivin V. Solatorio, Olivier Dupriez

TL;DR

<3-5 sentence high-level summary> This paper introduces REaLTabFormer, a transformer-based framework for synthesizing realistic non-relational and relational tabular data by combining a GPT-2 autoregressive model for the parent table with a conditioned Seq2Seq child-generator. It adds privacy- and quality-focused mechanisms, including target masking, distance-to-closest-record (DCR) based overfitting detection, and a bootstrap-derived Q_delta statistic for early stopping, to mitigate data copying while preserving realism. The authors demonstrate strong predictive utility on large non-relational datasets without task-specific tuning and show improved relational structure capture compared with baselines on Rossmann and Airbnb data. They also release an open-source Python package to facilitate adoption and further research in synthetic tabular data generation and privacy-aware sampling.

Abstract

Tabular data is a common form of organizing data. Multiple models are available to generate synthetic tabular datasets where observations are independent, but few have the ability to produce relational datasets. Modeling relational data is challenging as it requires modeling both a "parent" table and its relationships across tables. We introduce REaLTabFormer (Realistic Relational and Tabular Transformer), a tabular and relational synthetic data generation model. It first creates a parent table using an autoregressive GPT-2 model, then generates the relational dataset conditioned on the parent table using a sequence-to-sequence (Seq2Seq) model. We implement target masking to prevent data copying and propose the $Q_δ$ statistic and statistical bootstrapping to detect overfitting. Experiments using real-world datasets show that REaLTabFormer captures the relational structure better than a baseline model. REaLTabFormer also achieves state-of-the-art results on prediction tasks, "out-of-the-box", for large non-relational datasets without needing fine-tuning.

REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers

TL;DR

<3-5 sentence high-level summary> This paper introduces REaLTabFormer, a transformer-based framework for synthesizing realistic non-relational and relational tabular data by combining a GPT-2 autoregressive model for the parent table with a conditioned Seq2Seq child-generator. It adds privacy- and quality-focused mechanisms, including target masking, distance-to-closest-record (DCR) based overfitting detection, and a bootstrap-derived Q_delta statistic for early stopping, to mitigate data copying while preserving realism. The authors demonstrate strong predictive utility on large non-relational datasets without task-specific tuning and show improved relational structure capture compared with baselines on Rossmann and Airbnb data. They also release an open-source Python package to facilitate adoption and further research in synthetic tabular data generation and privacy-aware sampling.

Abstract

Tabular data is a common form of organizing data. Multiple models are available to generate synthetic tabular datasets where observations are independent, but few have the ability to produce relational datasets. Modeling relational data is challenging as it requires modeling both a "parent" table and its relationships across tables. We introduce REaLTabFormer (Realistic Relational and Tabular Transformer), a tabular and relational synthetic data generation model. It first creates a parent table using an autoregressive GPT-2 model, then generates the relational dataset conditioned on the parent table using a sequence-to-sequence (Seq2Seq) model. We implement target masking to prevent data copying and propose the statistic and statistical bootstrapping to detect overfitting. Experiments using real-world datasets show that REaLTabFormer captures the relational structure better than a baseline model. REaLTabFormer also achieves state-of-the-art results on prediction tasks, "out-of-the-box", for large non-relational datasets without needing fine-tuning.
Paper Structure (35 sections, 5 equations, 4 figures, 2 tables)

This paper contains 35 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.
  • Figure 2: Graph of the daily mean of the Sales variable computed from the original Rossmann dataset (blue), synthetic data produced by REaLTabFormer (orange), and data generated by SDV (green). The REaLTabFormer closely captures the seasonality in the data compared with the HMA model from the SDV.
  • Figure 3: Joint distributions of the age_group variable in the parent table and the device_type in the child table of the Airbnb test dataset (left), the SDV (middle), and the REalTabFormer (right). The plots show that the REaLTabFormer can synthesize values across the domain of the variables, while SDV learned only two device types out of thirteen. The REaLTabFormer also generalized and imputed age values for users with "iPodtouch" device (red box). This device type group has missing values for age in the original data.
  • Figure 4: Summary of the average "Sales" variable in the child table of the Rossmann dataset grouped by "StoreType" variable in the parent table. The values shown are from the original data (blue), synthetic data produced by REaLTabFormer (orange), and data generated by SDV (green). This graph shows that REaLTabFormer captures the inter-table variations and relationships well.