Table of Contents
Fetching ...

Preserving logical and functional dependencies in synthetic tabular data

Chaithra Umesh, Kristian Schultz, Manjunath Mahendra, Saparshi Bej, Olaf Wolkenhauer

TL;DR

This work addresses whether current synthetic tabular data generators preserve inter-attribute dependencies, introducing a Q-function $Q_T(\mathcal{A},\mathcal{B})$ to quantify logical dependencies and using FDTool to identify functional dependencies. It benchmarks seven generators (CTGAN, CTABGAN, CTABGAN Plus, TVAE, NextConvGeN, TabDDPM, TabuLa) on five public datasets, revealing that functional dependencies are poorly preserved across models, while several convex-space, diffusion-based, and transformer-based methods better retain logical dependencies. Notably, NextConvGeN, TabDDPM, and TabuLa show the strongest performance for preserving logical dependencies, with TabuLa offering broad consistency across datasets; none, however, adequately preserves all functional dependencies from real to synthetic data. The findings underscore a need for specialized FD-preserving synthetic data methods to maintain semantic consistency in downstream analyses, particularly in clinical contexts where dependencies drive diagnostics and treatment decisions.

Abstract

Dependencies among attributes are a common aspect of tabular data. However, whether existing tabular data generation algorithms preserve these dependencies while generating synthetic data is yet to be explored. In addition to the existing notion of functional dependencies, we introduce the notion of logical dependencies among the attributes in this article. Moreover, we provide a measure to quantify logical dependencies among attributes in tabular data. Utilizing this measure, we compare several state-of-the-art synthetic data generation algorithms and test their capability to preserve logical and functional dependencies on several publicly available datasets. We demonstrate that currently available synthetic tabular data generation algorithms do not fully preserve functional dependencies when they generate synthetic datasets. In addition, we also showed that some tabular synthetic data generation models can preserve inter-attribute logical dependencies. Our review and comparison of the state-of-the-art reveal research needs and opportunities to develop task-specific synthetic tabular data generation models.

Preserving logical and functional dependencies in synthetic tabular data

TL;DR

This work addresses whether current synthetic tabular data generators preserve inter-attribute dependencies, introducing a Q-function to quantify logical dependencies and using FDTool to identify functional dependencies. It benchmarks seven generators (CTGAN, CTABGAN, CTABGAN Plus, TVAE, NextConvGeN, TabDDPM, TabuLa) on five public datasets, revealing that functional dependencies are poorly preserved across models, while several convex-space, diffusion-based, and transformer-based methods better retain logical dependencies. Notably, NextConvGeN, TabDDPM, and TabuLa show the strongest performance for preserving logical dependencies, with TabuLa offering broad consistency across datasets; none, however, adequately preserves all functional dependencies from real to synthetic data. The findings underscore a need for specialized FD-preserving synthetic data methods to maintain semantic consistency in downstream analyses, particularly in clinical contexts where dependencies drive diagnostics and treatment decisions.

Abstract

Dependencies among attributes are a common aspect of tabular data. However, whether existing tabular data generation algorithms preserve these dependencies while generating synthetic data is yet to be explored. In addition to the existing notion of functional dependencies, we introduce the notion of logical dependencies among the attributes in this article. Moreover, we provide a measure to quantify logical dependencies among attributes in tabular data. Utilizing this measure, we compare several state-of-the-art synthetic data generation algorithms and test their capability to preserve logical and functional dependencies on several publicly available datasets. We demonstrate that currently available synthetic tabular data generation algorithms do not fully preserve functional dependencies when they generate synthetic datasets. In addition, we also showed that some tabular synthetic data generation models can preserve inter-attribute logical dependencies. Our review and comparison of the state-of-the-art reveal research needs and opportunities to develop task-specific synthetic tabular data generation models.
Paper Structure (9 sections, 6 equations, 10 figures, 4 tables)

This paper contains 9 sections, 6 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Workflow of comparative analysis to assess preservation of functional and logical dependencies in synthetic tabular data using FDTool and Q-function algorithms.
  • Figure 2: The figure above has five histograms corresponding to the five datasets used in this study. On the $x$-axis, we plot the $Q$-scores in discrete bins. On the $y$-axis, we record the number of attribute pairs that attain a certain Value. For a pair of attributes, a $Q$-score value between $0$ and $1$ implies logical dependency between those two attributes, while a $Q$-score value of exactly $0$ indicates functional dependency between those two attributes. Conversely, a $Q$-score value of $1$ signifies no dependency between a pair of attributes/features in the data. The Migraine and Airbnb datasets have logical and functional dependencies, while other datasets only have logical dependencies between attributes in real data.
  • Figure 3: The chart illustrates the preservation of logical dependencies across different generative models, with the x-axis representing the models and the y-axis indicating the percentage of preserved logical dependencies. Each color corresponds to a different dataset. NextConvGeN, TabDDPM, and TabuLa models consistently exhibit higher percentages for all datasets, demonstrating their ability to retain more logical dependencies compared to other models.
  • Figure 4: Comparison of functional dependencies in Airbnb data: The figure displays Venn diagrams comparing functional dependencies in real (coral) and synthetic (green) Airbnb data from seven generative models. Numbers within circles indicate total counts of dependencies. Overlap shows shared dependencies retained by synthetic data. Notably, none of the generative models manage to preserve a larger number of dependencies than the real data. However, TabDDPM and TabuLa succeed in preserving some functional dependencies.
  • Figure 5: Comparison of functional dependencies in Migraine data: The figure displays Venn diagrams comparing functional dependencies in real (coral) and synthetic (green) Migraine data from various generative models. Numbers within circles indicate total counts of dependencies. Overlap shows shared dependencies retained by synthetic data. Notably, none of the generative models manage to preserve a larger number of dependencies than the real data. However, NextConvGeN and TabDDPM succeed in preserving some functional dependencies.
  • ...and 5 more figures