Table of Contents
Fetching ...

Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation

Yunbo Long, Liming Xu, Alexandra Brintrup

TL;DR

The study tackles the problem that synthetic tabular data often preserves marginal distributions but fails to maintain inter-column logical constraints, which are crucial for realistic realism. It introduces three metrics—$HCS$, $MDI$, and $DSI$—to quantify hierarchical consistency, multivariate dependencies, and distributional similarity, and evaluates five generation methods on the DataCo industrial dataset to assess these properties, alongside traditional statistics. Findings show that GReaT and SMOTE most effectively preserve inter-column logic, with TabSyn excelling at certain mathematical dependencies, while CTGAN and TabDDPM lag in maintaining coherent cross-column relationships; $DSI$ provides fine-grained insight into distributional alignment via Gaussian Mixture Models. The work contributes a practical evaluation framework and actionable directions for improving logical constraint preservation in synthetic tabular data, with potential impact on real-world deployments requiring coherent cross-column structures. $P(X,Y)$ and related conditional dependencies are central to the analysis, underscoring the need to move beyond distributional fidelity alone toward logic-aware generation.

Abstract

Current evaluations of synthetic tabular data mainly focus on how well joint distributions are modeled, often overlooking the assessment of their effectiveness in preserving realistic event sequences and coherent entity relationships across columns.This paper proposes three evaluation metrics designed to assess the preservation of logical relationships among columns in synthetic tabular data. We validate these metrics by assessing the performance of both classical and state-of-the-art generation methods on a real-world industrial dataset.Experimental results reveal that existing methods often fail to rigorously maintain logical consistency (e.g., hierarchical relationships in geography or organization) and dependencies (e.g., temporal sequences or mathematical relationships), which are crucial for preserving the fine-grained realism of real-world tabular data. Building on these insights, this study also discusses possible pathways to better capture logical relationships while modeling the distribution of synthetic tabular data.

Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation

TL;DR

The study tackles the problem that synthetic tabular data often preserves marginal distributions but fails to maintain inter-column logical constraints, which are crucial for realistic realism. It introduces three metrics—, , and —to quantify hierarchical consistency, multivariate dependencies, and distributional similarity, and evaluates five generation methods on the DataCo industrial dataset to assess these properties, alongside traditional statistics. Findings show that GReaT and SMOTE most effectively preserve inter-column logic, with TabSyn excelling at certain mathematical dependencies, while CTGAN and TabDDPM lag in maintaining coherent cross-column relationships; provides fine-grained insight into distributional alignment via Gaussian Mixture Models. The work contributes a practical evaluation framework and actionable directions for improving logical constraint preservation in synthetic tabular data, with potential impact on real-world deployments requiring coherent cross-column structures. and related conditional dependencies are central to the analysis, underscoring the need to move beyond distributional fidelity alone toward logic-aware generation.

Abstract

Current evaluations of synthetic tabular data mainly focus on how well joint distributions are modeled, often overlooking the assessment of their effectiveness in preserving realistic event sequences and coherent entity relationships across columns.This paper proposes three evaluation metrics designed to assess the preservation of logical relationships among columns in synthetic tabular data. We validate these metrics by assessing the performance of both classical and state-of-the-art generation methods on a real-world industrial dataset.Experimental results reveal that existing methods often fail to rigorously maintain logical consistency (e.g., hierarchical relationships in geography or organization) and dependencies (e.g., temporal sequences or mathematical relationships), which are crucial for preserving the fine-grained realism of real-world tabular data. Building on these insights, this study also discusses possible pathways to better capture logical relationships while modeling the distribution of synthetic tabular data.

Paper Structure

This paper contains 14 sections, 7 equations, 5 tables.