Table of Contents
Fetching ...

Cross-table Synthetic Tabular Data Detection

G. Charbel N. Kindji, Lina Maria Rojas-Barahona, Elisa Fromont, Tanguy Urvoy

TL;DR

The paper tackles the problem of detecting synthetic tabular data across unseen generators and table formats, introducing the challenge of cross-table shift. It proposes three table-agnostic detectors—two text-based and one table-based—and evaluates them under four wildness scenarios using 14 real datasets and four synthetic data generators (TabDDPM, TabSyn, TVAE, CTGAN). Results show strong detection without distribution shift, especially for the table-based transformer, but cross-table shift severely degrades performance, revealing gaps in generalization. The work highlights the practical need for robust cross-table detection and points to future directions such as incorporating table metadata and leveraging pretrained encoders like TaBERT to improve generalization in real-world settings.

Abstract

Detecting synthetic tabular data is essential to prevent the distribution of false or manipulated datasets that could compromise data-driven decision-making. This study explores whether synthetic tabular data can be reliably identified ''in the wild''-meaning across different generators, domains, and table formats. This challenge is unique to tabular data, where structures (such as number of columns, data types, and formats) can vary widely from one table to another. We propose three cross-table baseline detectors and four distinct evaluation protocols, each corresponding to a different level of ''wildness''. Our very preliminary results confirm that cross-table adaptation is a challenging task.

Cross-table Synthetic Tabular Data Detection

TL;DR

The paper tackles the problem of detecting synthetic tabular data across unseen generators and table formats, introducing the challenge of cross-table shift. It proposes three table-agnostic detectors—two text-based and one table-based—and evaluates them under four wildness scenarios using 14 real datasets and four synthetic data generators (TabDDPM, TabSyn, TVAE, CTGAN). Results show strong detection without distribution shift, especially for the table-based transformer, but cross-table shift severely degrades performance, revealing gaps in generalization. The work highlights the practical need for robust cross-table detection and points to future directions such as incorporating table metadata and leveraging pretrained encoders like TaBERT to improve generalization in real-world settings.

Abstract

Detecting synthetic tabular data is essential to prevent the distribution of false or manipulated datasets that could compromise data-driven decision-making. This study explores whether synthetic tabular data can be reliably identified ''in the wild''-meaning across different generators, domains, and table formats. This challenge is unique to tabular data, where structures (such as number of columns, data types, and formats) can vary widely from one table to another. We propose three cross-table baseline detectors and four distinct evaluation protocols, each corresponding to a different level of ''wildness''. Our very preliminary results confirm that cross-table adaptation is a challenging task.

Paper Structure

This paper contains 19 sections, 1 equation, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Training and validation AUC performance of models trained under cross-table shift setup. Left: text-based model and right: table-based approach.