Table of Contents
Fetching ...

Robust Detection of Synthetic Tabular Data under Schema Variability

G. Charbel N. Kindji, Elisa Fromont, Lina Maria Rojas-Barahona, Tanguy Urvoy

TL;DR

This work tackles the problem of detecting synthetic tabular data under real-world schema variability by introducing a table-agnostic, column-permutation-invariant datum-wise Transformer with a gradient-reversal-based table adaptation mechanism. The approach processes per-datum textual representations and aggregates them to row-level predictions, achieving state-of-the-art performance on cross-table shift (AUC ~0.69, accuracy ~0.66) and demonstrating robustness through table adaptation. Cross-domain transfer remains challenging, underscoring the need for additional domain-generalization strategies, yet the results provide strong evidence that robust synthetic-tabular-data detection is feasible in real-world conditions. The study also highlights the potential for extending this architecture to broader tabular tasks and motivates future work on pretraining, few-shot learning, and more comprehensive domain coverage.

Abstract

The rise of powerful generative models has sparked concerns over data authenticity. While detection methods have been extensively developed for images and text, the case of tabular data, despite its ubiquity, has been largely overlooked. Yet, detecting synthetic tabular data is especially challenging due to its heterogeneous structure and unseen formats at test time. We address the underexplored task of detecting synthetic tabular data ''in the wild'', i.e. when the detector is deployed on tables with variable and previously unseen schemas. We introduce a novel datum-wise transformer architecture that significantly outperforms the only previously published baseline, improving both AUC and accuracy by 7 points. By incorporating a table-adaptation component, our model gains an additional 7 accuracy points, demonstrating enhanced robustness. This work provides the first strong evidence that detecting synthetic tabular data in real-world conditions is feasible, and demonstrates substantial improvements over previous approaches. Following acceptance of the paper, we are finalizing the administrative and licensing procedures necessary for releasing the source code. This extended version will be updated as soon as the release is complete.

Robust Detection of Synthetic Tabular Data under Schema Variability

TL;DR

This work tackles the problem of detecting synthetic tabular data under real-world schema variability by introducing a table-agnostic, column-permutation-invariant datum-wise Transformer with a gradient-reversal-based table adaptation mechanism. The approach processes per-datum textual representations and aggregates them to row-level predictions, achieving state-of-the-art performance on cross-table shift (AUC ~0.69, accuracy ~0.66) and demonstrating robustness through table adaptation. Cross-domain transfer remains challenging, underscoring the need for additional domain-generalization strategies, yet the results provide strong evidence that robust synthetic-tabular-data detection is feasible in real-world conditions. The study also highlights the potential for extending this architecture to broader tabular tasks and motivates future work on pretraining, few-shot learning, and more comprehensive domain coverage.

Abstract

The rise of powerful generative models has sparked concerns over data authenticity. While detection methods have been extensively developed for images and text, the case of tabular data, despite its ubiquity, has been largely overlooked. Yet, detecting synthetic tabular data is especially challenging due to its heterogeneous structure and unseen formats at test time. We address the underexplored task of detecting synthetic tabular data ''in the wild'', i.e. when the detector is deployed on tables with variable and previously unseen schemas. We introduce a novel datum-wise transformer architecture that significantly outperforms the only previously published baseline, improving both AUC and accuracy by 7 points. By incorporating a table-adaptation component, our model gains an additional 7 accuracy points, demonstrating enhanced robustness. This work provides the first strong evidence that detecting synthetic tabular data in real-world conditions is feasible, and demonstrates substantial improvements over previous approaches. Following acceptance of the paper, we are finalizing the administrative and licensing procedures necessary for releasing the source code. This extended version will be updated as soon as the release is complete.

Paper Structure

This paper contains 31 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Cross-table shift protocol: the real-vs-synthetic detector is trained on a mixture of table rows and tested/deployed on a mixture from holdout tables.
  • Figure 2: Datum-wise transformer pipeline with table adaptation head.
  • Figure 3: t-SNE projection of row embeddings colored by table. The embeddings are extracted from the trained datum-wise model before and after the table adaptation strategy considering table names as domains.
  • Figure 4: t-SNE projection of row embeddings colored by table for the fine-tuned BART baseline on the first fold of the cross-table shift setting.
  • Figure 5: t-SNE projection of row embeddings colored by table for the datum-wise model in the Social domain under the cross-domain table shift setting.
  • ...and 4 more figures