Table of Contents
Fetching ...

Datum-wise Transformer for Synthetic Tabular Data Detection in the Wild

G. Charbel N. Kindji, Elisa Fromont, Lina Maria Rojas-Barahona, Tanguy Urvoy

TL;DR

Detect synthetic tabular data under cross-table shift by proposing a datum-wise transformer that encodes per-datum strings into 192-dimensional embeddings and pools with a row transformer to a CLS-Target for binary classification. The model is table-agnostic and permutation-invariant, enhanced by a domain-adaptation head using gradient reversal to reduce dependence on table structure. Empirical results on real and synthetic tables show the datum-wise approach outperforms baselines (Flat Text, TaBERT-embd, BART-embd) and gains further with domain adaptation, particularly under cross-table shift. The work offers a robust, scalable path toward tabular foundation models and practical synthetic data detection in real-world deployments.

Abstract

The growing power of generative models raises major concerns about the authenticity of published content. To address this problem, several synthetic content detection methods have been proposed for uniformly structured media such as image or text. However, little work has been done on the detection of synthetic tabular data, despite its importance in industry and government. This form of data is complex to handle due to the diversity of its structures: the number and types of the columns may vary wildly from one table to another. We tackle the tough problem of detecting synthetic tabular data ''in the wild'', i.e. when the model is deployed on table structures it has never seen before. We introduce a novel datum-wise transformer architecture and show that it outperforms existing models. Furthermore, we investigate the application of domain adaptation techniques to enhance the effectiveness of our model, thereby providing a more robust data-forgery detection solution.

Datum-wise Transformer for Synthetic Tabular Data Detection in the Wild

TL;DR

Detect synthetic tabular data under cross-table shift by proposing a datum-wise transformer that encodes per-datum strings into 192-dimensional embeddings and pools with a row transformer to a CLS-Target for binary classification. The model is table-agnostic and permutation-invariant, enhanced by a domain-adaptation head using gradient reversal to reduce dependence on table structure. Empirical results on real and synthetic tables show the datum-wise approach outperforms baselines (Flat Text, TaBERT-embd, BART-embd) and gains further with domain adaptation, particularly under cross-table shift. The work offers a robust, scalable path toward tabular foundation models and practical synthetic data detection in real-world deployments.

Abstract

The growing power of generative models raises major concerns about the authenticity of published content. To address this problem, several synthetic content detection methods have been proposed for uniformly structured media such as image or text. However, little work has been done on the detection of synthetic tabular data, despite its importance in industry and government. This form of data is complex to handle due to the diversity of its structures: the number and types of the columns may vary wildly from one table to another. We tackle the tough problem of detecting synthetic tabular data ''in the wild'', i.e. when the model is deployed on table structures it has never seen before. We introduce a novel datum-wise transformer architecture and show that it outperforms existing models. Furthermore, we investigate the application of domain adaptation techniques to enhance the effectiveness of our model, thereby providing a more robust data-forgery detection solution.

Paper Structure

This paper contains 17 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Example of mixed table classification instances with their labels and domains.
  • Figure 2: Datum-wise transformer pipeline with domain adaptation head.
  • Figure 3: Cross-table shift protocol: the real-vs-synthetic detector is trained on a mixture of table rows and tested/deployed on a mixture from holdout tables.
  • Figure 4: We present the average and standard deviation of AUC performance during the first 10 epochs of training and validation under cross-table shift.