TabRep: Training Tabular Diffusion Models with a Simple and Effective Continuous Representation
Jacob Si, Zijing Ou, Mike Qu, Zhengrui Xiang, Yingzhen Li
TL;DR
Tabular diffusion models struggle with mixed feature modalities when using separate representations or sparse encodings.TabRep introduces a unified continuous representation, CatConverter, mapping categorical features to a dense 2D phase space and integrating seamlessly with DDPM or Flow Matching, supported by geometric insights into data manifolds.Empirical results across seven datasets show TabRep yields superior downstream quality, robust privacy preservation, and faster training/sampling compared to strong baselines, including scenarios with high-cardinality and ordinal features.The work demonstrates that a simple, density-aware representation can unlock high-fidelity, privacy-preserving tabular data synthesis with diffusion models and points to future explorations of reverse continuous-to-categorical transitions and time-series tokenization.
Abstract
Diffusion models have been the predominant generative model for tabular data generation. However, they face the conundrum of modeling under a separate versus a unified data representation. The former encounters the challenge of jointly modeling all multi-modal distributions of tabular data in one model. While the latter alleviates this by learning a single representation for all features, it currently leverages sparse suboptimal encoding heuristics and necessitates additional computation costs. In this work, we address the latter by presenting TabRep, a tabular diffusion architecture trained with a unified continuous representation. To motivate the design of our representation, we provide geometric insights into how the data manifold affects diffusion models. The key attributes of our representation are composed of its density, flexibility to provide ample separability for nominal features, and ability to preserve intrinsic relationships. Ultimately, TabRep provides a simple yet effective approach for training tabular diffusion models under a continuous data manifold. Our results showcase that TabRep achieves superior performance across a broad suite of evaluations. It is the first to synthesize tabular data that exceeds the downstream quality of the original datasets while preserving privacy and remaining computationally efficient. Code is available at https://github.com/jacobyhsi/TabRep.
