Table of Contents
Fetching ...

Beyond the convexity assumption: Realistic tabular data generation under quantifier-free real linear constraints

Mihaela Cătălina Stoian, Eleonora Giunchiglia

TL;DR

Disjunctive Refinement Layer (DRL), a novel layer designed to enforce the alignment of generated data with the background knowledge specified in user-defined constraints, is introduced, becoming the first method able to automatically make deep learning models inherently compliant with constraints as expressive as quantifier-free linear formulas.

Abstract

Synthetic tabular data generation has traditionally been a challenging problem due to the high complexity of the underlying distributions that characterise this type of data. Despite recent advances in deep generative models (DGMs), existing methods often fail to produce realistic datapoints that are well-aligned with available background knowledge. In this paper, we address this limitation by introducing Disjunctive Refinement Layer (DRL), a novel layer designed to enforce the alignment of generated data with the background knowledge specified in user-defined constraints. DRL is the first method able to automatically make deep learning models inherently compliant with constraints as expressive as quantifier-free linear formulas, which can define non-convex and even disconnected spaces. Our experimental analysis shows that DRL not only guarantees constraint satisfaction but also improves efficacy in downstream tasks. Notably, when applied to DGMs that frequently violate constraints, DRL eliminates violations entirely. Further, it improves performance metrics by up to 21.4% in F1-score and 20.9% in Area Under the ROC Curve, thus demonstrating its practical impact on data generation.

Beyond the convexity assumption: Realistic tabular data generation under quantifier-free real linear constraints

TL;DR

Disjunctive Refinement Layer (DRL), a novel layer designed to enforce the alignment of generated data with the background knowledge specified in user-defined constraints, is introduced, becoming the first method able to automatically make deep learning models inherently compliant with constraints as expressive as quantifier-free linear formulas.

Abstract

Synthetic tabular data generation has traditionally been a challenging problem due to the high complexity of the underlying distributions that characterise this type of data. Despite recent advances in deep generative models (DGMs), existing methods often fail to produce realistic datapoints that are well-aligned with available background knowledge. In this paper, we address this limitation by introducing Disjunctive Refinement Layer (DRL), a novel layer designed to enforce the alignment of generated data with the background knowledge specified in user-defined constraints. DRL is the first method able to automatically make deep learning models inherently compliant with constraints as expressive as quantifier-free linear formulas, which can define non-convex and even disconnected spaces. Our experimental analysis shows that DRL not only guarantees constraint satisfaction but also improves efficacy in downstream tasks. Notably, when applied to DGMs that frequently violate constraints, DRL eliminates violations entirely. Further, it improves performance metrics by up to 21.4% in F1-score and 20.9% in Area Under the ROC Curve, thus demonstrating its practical impact on data generation.

Paper Structure

This paper contains 40 sections, 11 theorems, 16 equations, 9 figures, 25 tables, 1 algorithm.

Key Result

Lemma 3.1

Let $\Pi$ be a finite and satisfiable set of constraints in a single variable $x_i$. For every sample $\tilde{x}$, $\text{DRL}(\tilde{x})$ satisfies $\Pi$ and is optimal w.r.t. $\tilde{x}$.

Figures (9)

  • Figure 1:
  • Figure 2:
  • Figure 4: Visualisation of left and right boundaries defined by constraint $\Psi$. The green regions correspond to the values of $x_i \in \Omega(\Psi)$.
  • Figure 5: Constraints for Ex. \ref{['ex:one_dim']}.
  • Figure 6: Sample distributions for real and synthetic data from TVAE, TVAE+LL and TVAE+DRL. The regions where samples violate the constraints are in red.
  • ...and 4 more figures

Theorems & Definitions (21)

  • Example 1
  • Lemma 3.1
  • Example 2
  • Lemma 3.2
  • Example 3: Example \ref{['ex:intro']}, cont'd
  • Example 4: Example \ref{['ex:valid_intervals']}, cont'd
  • Lemma 3.3
  • Corollary 3.4
  • Example 5: Examples \ref{['ex:one_dim']}, \ref{['ex:pis_computation']}, cont'd
  • Theorem 3.5
  • ...and 11 more