Table of Contents
Fetching ...

TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation

Juntong Shi, Minkai Xu, Harper Hua, Hengrui Zhang, Stefano Ermon, Jure Leskovec

TL;DR

TabDiff tackles the challenging problem of generating high-quality tabular data with mixed numerical and categorical features by proposing a unified, continuous-time diffusion model. It introduces feature-wise learnable noise schedules, a transformer-based denoiser, a backward stochastic sampler, and classifier-free guidance for conditional generation such as missing-value imputation. The framework directly models the joint data distribution in the original space, achieving state-of-the-art results across seven real-world datasets and eight evaluation metrics, including substantial gains in column-wise correlation fidelity. The combination of adaptive noise schedules, stochastic sampling, and CFG makes TabDiff a robust and versatile tool for data synthesis, privacy-preserving augmentation, and downstream analysis.

Abstract

Synthesizing high-quality tabular data is an important topic in many data science tasks, ranging from dataset augmentation to privacy protection. However, developing expressive generative models for tabular data is challenging due to its inherent heterogeneous data types, complex inter-correlations, and intricate column-wise distributions. In this paper, we introduce TabDiff, a joint diffusion framework that models all mixed-type distributions of tabular data in one model. Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data, where we propose feature-wise learnable diffusion processes to counter the high disparity of different feature distributions. TabDiff is parameterized by a transformer handling different input types, and the entire framework can be efficiently optimized in an end-to-end fashion. We further introduce a mixed-type stochastic sampler to automatically correct the accumulated decoding error during sampling, and propose classifier-free guidance for conditional missing column value imputation. Comprehensive experiments on seven datasets demonstrate that TabDiff achieves superior average performance over existing competitive baselines across all eight metrics, with up to $22.5\%$ improvement over the state-of-the-art model on pair-wise column correlation estimations. Code is available at https://github.com/MinkaiXu/TabDiff.

TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation

TL;DR

TabDiff tackles the challenging problem of generating high-quality tabular data with mixed numerical and categorical features by proposing a unified, continuous-time diffusion model. It introduces feature-wise learnable noise schedules, a transformer-based denoiser, a backward stochastic sampler, and classifier-free guidance for conditional generation such as missing-value imputation. The framework directly models the joint data distribution in the original space, achieving state-of-the-art results across seven real-world datasets and eight evaluation metrics, including substantial gains in column-wise correlation fidelity. The combination of adaptive noise schedules, stochastic sampling, and CFG makes TabDiff a robust and versatile tool for data synthesis, privacy-preserving augmentation, and downstream analysis.

Abstract

Synthesizing high-quality tabular data is an important topic in many data science tasks, ranging from dataset augmentation to privacy protection. However, developing expressive generative models for tabular data is challenging due to its inherent heterogeneous data types, complex inter-correlations, and intricate column-wise distributions. In this paper, we introduce TabDiff, a joint diffusion framework that models all mixed-type distributions of tabular data in one model. Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data, where we propose feature-wise learnable diffusion processes to counter the high disparity of different feature distributions. TabDiff is parameterized by a transformer handling different input types, and the entire framework can be efficiently optimized in an end-to-end fashion. We further introduce a mixed-type stochastic sampler to automatically correct the accumulated decoding error during sampling, and propose classifier-free guidance for conditional missing column value imputation. Comprehensive experiments on seven datasets demonstrate that TabDiff achieves superior average performance over existing competitive baselines across all eight metrics, with up to improvement over the state-of-the-art model on pair-wise column correlation estimations. Code is available at https://github.com/MinkaiXu/TabDiff.

Paper Structure

This paper contains 33 sections, 28 equations, 17 figures, 14 tables, 2 algorithms.

Figures (17)

  • Figure 1: A high-level overview of TabDiff. TabDiff operates by normalizing numerical columns and converting categorical columns into one-hot vectors with an extra [MASK] class. Joint forward diffusion processes are applied to all modalities with each column's noise rate controlled by learnable schedules. New samples are generated via reverse process, with the denoising network gradually denoising ${\mathbf{x}}_1$ into ${{\mathbf{x}}}_0$ and then applying the inverse transform to recover the original format.
  • Figure 2: The adaptively learnable noise schedules reduce training loss.
  • Figure 3: Visualization of the marginal densities of the generated data in comparison to the real data. Top and Middle: individual numerical column; Bottom: individual categorical column.
  • Figure 4: Pair-wise correlation heatmaps. Values represent the error rate (the lighter, the better).
  • Figure 5: Ablation Studies on the stochastic sampler and learnable noise schedules.
  • ...and 12 more figures