Table of Contents
Fetching ...

TabRep: Training Tabular Diffusion Models with a Simple and Effective Continuous Representation

Jacob Si, Zijing Ou, Mike Qu, Zhengrui Xiang, Yingzhen Li

TL;DR

Tabular diffusion models struggle with mixed feature modalities when using separate representations or sparse encodings.TabRep introduces a unified continuous representation, CatConverter, mapping categorical features to a dense 2D phase space and integrating seamlessly with DDPM or Flow Matching, supported by geometric insights into data manifolds.Empirical results across seven datasets show TabRep yields superior downstream quality, robust privacy preservation, and faster training/sampling compared to strong baselines, including scenarios with high-cardinality and ordinal features.The work demonstrates that a simple, density-aware representation can unlock high-fidelity, privacy-preserving tabular data synthesis with diffusion models and points to future explorations of reverse continuous-to-categorical transitions and time-series tokenization.

Abstract

Diffusion models have been the predominant generative model for tabular data generation. However, they face the conundrum of modeling under a separate versus a unified data representation. The former encounters the challenge of jointly modeling all multi-modal distributions of tabular data in one model. While the latter alleviates this by learning a single representation for all features, it currently leverages sparse suboptimal encoding heuristics and necessitates additional computation costs. In this work, we address the latter by presenting TabRep, a tabular diffusion architecture trained with a unified continuous representation. To motivate the design of our representation, we provide geometric insights into how the data manifold affects diffusion models. The key attributes of our representation are composed of its density, flexibility to provide ample separability for nominal features, and ability to preserve intrinsic relationships. Ultimately, TabRep provides a simple yet effective approach for training tabular diffusion models under a continuous data manifold. Our results showcase that TabRep achieves superior performance across a broad suite of evaluations. It is the first to synthesize tabular data that exceeds the downstream quality of the original datasets while preserving privacy and remaining computationally efficient. Code is available at https://github.com/jacobyhsi/TabRep.

TabRep: Training Tabular Diffusion Models with a Simple and Effective Continuous Representation

TL;DR

Tabular diffusion models struggle with mixed feature modalities when using separate representations or sparse encodings.TabRep introduces a unified continuous representation, CatConverter, mapping categorical features to a dense 2D phase space and integrating seamlessly with DDPM or Flow Matching, supported by geometric insights into data manifolds.Empirical results across seven datasets show TabRep yields superior downstream quality, robust privacy preservation, and faster training/sampling compared to strong baselines, including scenarios with high-cardinality and ordinal features.The work demonstrates that a simple, density-aware representation can unlock high-fidelity, privacy-preserving tabular data synthesis with diffusion models and points to future explorations of reverse continuous-to-categorical transitions and time-series tokenization.

Abstract

Diffusion models have been the predominant generative model for tabular data generation. However, they face the conundrum of modeling under a separate versus a unified data representation. The former encounters the challenge of jointly modeling all multi-modal distributions of tabular data in one model. While the latter alleviates this by learning a single representation for all features, it currently leverages sparse suboptimal encoding heuristics and necessitates additional computation costs. In this work, we address the latter by presenting TabRep, a tabular diffusion architecture trained with a unified continuous representation. To motivate the design of our representation, we provide geometric insights into how the data manifold affects diffusion models. The key attributes of our representation are composed of its density, flexibility to provide ample separability for nominal features, and ability to preserve intrinsic relationships. Ultimately, TabRep provides a simple yet effective approach for training tabular diffusion models under a continuous data manifold. Our results showcase that TabRep achieves superior performance across a broad suite of evaluations. It is the first to synthesize tabular data that exceeds the downstream quality of the original datasets while preserving privacy and remaining computationally efficient. Code is available at https://github.com/jacobyhsi/TabRep.

Paper Structure

This paper contains 28 sections, 2 theorems, 59 equations, 9 figures, 16 tables, 2 algorithms.

Key Result

Theorem 4.1

Assume $x$ is a noisy observation from a Gaussian centered at a weighted one-hot vector $\alpha_t e_k \in \mathbb{R}^{K}$. We can define the forward diffusion process as: $p_t(x|e_k)=\mathcal{N}(x|\alpha_te_k,\sigma_t^2I)$. We derive the variance of the conditional score function evaluated at a mini

Figures (9)

  • Figure 1: The TabRep Architecture. TabRep transforms and unifies the data space under a continuous regime via the our representation. A diffusion or flow matching process is trained to optimize the denoising network. Once training is completed, samples can be generated through a reverse denoising process before inverse transforming back into their original data representation.
  • Figure 1: Categorical Representation Dimensions.
  • Figure 2: Singular Regions in a 3D One-Hot Setting.
  • Figure 3: Separability of CatConverter. CatConverter preserves nominal features for up to $128$ categories.
  • Figure 4: Training TabRep-DDPM/Flow
  • ...and 4 more figures

Theorems & Definitions (5)

  • Definition 4.1: $n$-singular point
  • Definition 4.2: $n$-singular hyperplane
  • Theorem 4.1: Variance of Conditional Score Function
  • Theorem A.1: Variance of Conditional Score Function
  • proof