Diffusion-nested Auto-Regressive Synthesis of Heterogeneous Tabular Data

Hengrui Zhang; Liancheng Fang; Qitian Wu; Philip S. Yu

Diffusion-nested Auto-Regressive Synthesis of Heterogeneous Tabular Data

Hengrui Zhang, Liancheng Fang, Qitian Wu, Philip S. Yu

TL;DR

To enable autoregressive methods for continuous columns, TabDAR employs a diffusion model to parameterize the conditional distribution of continuous features, which enables TabDAR to not only freely handle heterogeneous tabular data but also support convenient and flexible unconditional/conditional sampling.

Abstract

Autoregressive models are predominant in natural language generation, while their application in tabular data remains underexplored. We posit that this can be attributed to two factors: 1) tabular data contains heterogeneous data type, while the autoregressive model is primarily designed to model discrete-valued data; 2) tabular data is column permutation-invariant, requiring a generation model to generate columns in arbitrary order. This paper proposes a Diffusion-nested Autoregressive model (TabDAR) to address these issues. To enable autoregressive methods for continuous columns, TabDAR employs a diffusion model to parameterize the conditional distribution of continuous features. To ensure arbitrary generation order, TabDAR resorts to masked transformers with bi-directional attention, which simulate various permutations of column order, hence enabling it to learn the conditional distribution of a target column given an arbitrary combination of other columns. These designs enable TabDAR to not only freely handle heterogeneous tabular data but also support convenient and flexible unconditional/conditional sampling. We conduct extensive experiments on ten datasets with distinct properties, and the proposed TabDAR outperforms previous state-of-the-art methods by 18% to 45% on eight metrics across three distinct aspects.

Diffusion-nested Auto-Regressive Synthesis of Heterogeneous Tabular Data

TL;DR

Abstract

Diffusion-nested Auto-Regressive Synthesis of Heterogeneous Tabular Data

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)