Preserving logical and functional dependencies in synthetic tabular data
Chaithra Umesh, Kristian Schultz, Manjunath Mahendra, Saparshi Bej, Olaf Wolkenhauer
TL;DR
This work addresses whether current synthetic tabular data generators preserve inter-attribute dependencies, introducing a Q-function $Q_T(\mathcal{A},\mathcal{B})$ to quantify logical dependencies and using FDTool to identify functional dependencies. It benchmarks seven generators (CTGAN, CTABGAN, CTABGAN Plus, TVAE, NextConvGeN, TabDDPM, TabuLa) on five public datasets, revealing that functional dependencies are poorly preserved across models, while several convex-space, diffusion-based, and transformer-based methods better retain logical dependencies. Notably, NextConvGeN, TabDDPM, and TabuLa show the strongest performance for preserving logical dependencies, with TabuLa offering broad consistency across datasets; none, however, adequately preserves all functional dependencies from real to synthetic data. The findings underscore a need for specialized FD-preserving synthetic data methods to maintain semantic consistency in downstream analyses, particularly in clinical contexts where dependencies drive diagnostics and treatment decisions.
Abstract
Dependencies among attributes are a common aspect of tabular data. However, whether existing tabular data generation algorithms preserve these dependencies while generating synthetic data is yet to be explored. In addition to the existing notion of functional dependencies, we introduce the notion of logical dependencies among the attributes in this article. Moreover, we provide a measure to quantify logical dependencies among attributes in tabular data. Utilizing this measure, we compare several state-of-the-art synthetic data generation algorithms and test their capability to preserve logical and functional dependencies on several publicly available datasets. We demonstrate that currently available synthetic tabular data generation algorithms do not fully preserve functional dependencies when they generate synthetic datasets. In addition, we also showed that some tabular synthetic data generation models can preserve inter-attribute logical dependencies. Our review and comparison of the state-of-the-art reveal research needs and opportunities to develop task-specific synthetic tabular data generation models.
