Table of Contents
Fetching ...

Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space

Hengrui Zhang, Jiani Zhang, Balasubramaniam Srinivasan, Zhengyuan Shen, Xiao Qin, Christos Faloutsos, Huzefa Rangwala, George Karypis

TL;DR

TabSyn tackles the challenge of synthesizing mixed-type tabular data by mapping heterogeneous columns into a continuous latent space with a Transformer-based VAE and then learning a score-based diffusion model in that latent space. The approach enables joint modeling of inter-column dependencies and mixed data types while delivering fast sampling through a linear noise schedule and fewer than 20 reverse steps. Extensive experiments on six real-world datasets show TabSyn consistently outperforms seven baselines across low- and high-order metrics and demonstrates strong downstream utility in ML efficiency and missing-value imputation. The work advances tabular data generation by combining unified latent representations with efficient latent diffusion, offering a generally applicable, high-quality, and fast synthetic data generator for practical use.

Abstract

Recent advances in tabular data generation have greatly enhanced synthetic data quality. However, extending diffusion models to tabular data is challenging due to the intricately varied distributions and a blend of data types of tabular data. This paper introduces Tabsyn, a methodology that synthesizes tabular data by leveraging a diffusion model within a variational autoencoder (VAE) crafted latent space. The key advantages of the proposed Tabsyn include (1) Generality: the ability to handle a broad spectrum of data types by converting them into a single unified space and explicitly capture inter-column relations; (2) Quality: optimizing the distribution of latent embeddings to enhance the subsequent training of diffusion models, which helps generate high-quality synthetic data, (3) Speed: much fewer number of reverse steps and faster synthesis speed than existing diffusion-based methods. Extensive experiments on six datasets with five metrics demonstrate that Tabsyn outperforms existing methods. Specifically, it reduces the error rates by 86% and 67% for column-wise distribution and pair-wise column correlation estimations compared with the most competitive baselines.

Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space

TL;DR

TabSyn tackles the challenge of synthesizing mixed-type tabular data by mapping heterogeneous columns into a continuous latent space with a Transformer-based VAE and then learning a score-based diffusion model in that latent space. The approach enables joint modeling of inter-column dependencies and mixed data types while delivering fast sampling through a linear noise schedule and fewer than 20 reverse steps. Extensive experiments on six real-world datasets show TabSyn consistently outperforms seven baselines across low- and high-order metrics and demonstrates strong downstream utility in ML efficiency and missing-value imputation. The work advances tabular data generation by combining unified latent representations with efficient latent diffusion, offering a generally applicable, high-quality, and fast synthetic data generator for practical use.

Abstract

Recent advances in tabular data generation have greatly enhanced synthetic data quality. However, extending diffusion models to tabular data is challenging due to the intricately varied distributions and a blend of data types of tabular data. This paper introduces Tabsyn, a methodology that synthesizes tabular data by leveraging a diffusion model within a variational autoencoder (VAE) crafted latent space. The key advantages of the proposed Tabsyn include (1) Generality: the ability to handle a broad spectrum of data types by converting them into a single unified space and explicitly capture inter-column relations; (2) Quality: optimizing the distribution of latent embeddings to enhance the subsequent training of diffusion models, which helps generate high-quality synthetic data, (3) Speed: much fewer number of reverse steps and faster synthesis speed than existing diffusion-based methods. Extensive experiments on six datasets with five metrics demonstrate that Tabsyn outperforms existing methods. Specifically, it reduces the error rates by 86% and 67% for column-wise distribution and pair-wise column correlation estimations compared with the most competitive baselines.
Paper Structure (56 sections, 2 theorems, 38 equations, 10 figures, 13 tables, 4 algorithms)

This paper contains 56 sections, 2 theorems, 38 equations, 10 figures, 13 tables, 4 algorithms.

Key Result

Proposition 1

Consider the reverse diffusion process in Equation (eqn:reverse) from ${\bm{z}}_{t_b}$ to ${\bm{z}}_{t_a} (t_b > t_a)$, the numerical solution $\hat{{\bm{z}}}_{t_a}$ has the smallest approximation error to ${\bm{z}}_{t_a}$ when $\sigma(t) = t$.

Figures (10)

  • Figure 1: Our TabSyn consistently outperforms SOTA tabular data generation methods across five data quality metrics.
  • Figure 2: An overview of the proposed TabSyn. Each row data $x$ is mapped to latent space $z$ via a column-wise tokenizer and an encoder. A diffusion process $z_0 \rightarrow z_T$ is applied in the latent space. Synthesis $z_T \rightarrow z_0$ starts from the base distribution $p(z_T)$ and generates samples $z_0$ in latent space through a reverse process. These samples are then mapped from latent $z$ to data space $\tilde{x}$ using a decoder and a detokenizer.
  • Figure 3: The trends of the validation reconstruction (left) and KL-divergence (right) losses on the Adult dataset, with varying constant $\beta$, and our proposed scheduled $\beta$ ($\beta_{\rm max} = 0.01, \beta_{\rm min} = 10^{-5}, \lambda = 0.7$). The proposed scheduled $\beta$ obtains the lowest reconstruction loss with a fairly low KL-divergence loss.
  • Figure 4: Quality of synthetic data as a function of NFEs on STaSy, TabDDPM, and TabSyn. TabSyn can generate synthetic data of the best quality with fewer NFEs (indicating faster sampling speed).
  • Figure 5: Visualization of synthetic data's single column distribution density (from STaSy, TabDDPM, and TabSyn) v.s. the real data. Upper: numerical columns; Lower: Categorical columns. Note that numerical columns show competitive performance with baselines, while TabSyn excels in estimating categorical column distributions.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Lemma 1