Table of Contents
Fetching ...

Diffusion-nested Auto-Regressive Synthesis of Heterogeneous Tabular Data

Hengrui Zhang, Liancheng Fang, Qitian Wu, Philip S. Yu

TL;DR

To enable autoregressive methods for continuous columns, TabDAR employs a diffusion model to parameterize the conditional distribution of continuous features, which enables TabDAR to not only freely handle heterogeneous tabular data but also support convenient and flexible unconditional/conditional sampling.

Abstract

Autoregressive models are predominant in natural language generation, while their application in tabular data remains underexplored. We posit that this can be attributed to two factors: 1) tabular data contains heterogeneous data type, while the autoregressive model is primarily designed to model discrete-valued data; 2) tabular data is column permutation-invariant, requiring a generation model to generate columns in arbitrary order. This paper proposes a Diffusion-nested Autoregressive model (TabDAR) to address these issues. To enable autoregressive methods for continuous columns, TabDAR employs a diffusion model to parameterize the conditional distribution of continuous features. To ensure arbitrary generation order, TabDAR resorts to masked transformers with bi-directional attention, which simulate various permutations of column order, hence enabling it to learn the conditional distribution of a target column given an arbitrary combination of other columns. These designs enable TabDAR to not only freely handle heterogeneous tabular data but also support convenient and flexible unconditional/conditional sampling. We conduct extensive experiments on ten datasets with distinct properties, and the proposed TabDAR outperforms previous state-of-the-art methods by 18% to 45% on eight metrics across three distinct aspects.

Diffusion-nested Auto-Regressive Synthesis of Heterogeneous Tabular Data

TL;DR

To enable autoregressive methods for continuous columns, TabDAR employs a diffusion model to parameterize the conditional distribution of continuous features, which enables TabDAR to not only freely handle heterogeneous tabular data but also support convenient and flexible unconditional/conditional sampling.

Abstract

Autoregressive models are predominant in natural language generation, while their application in tabular data remains underexplored. We posit that this can be attributed to two factors: 1) tabular data contains heterogeneous data type, while the autoregressive model is primarily designed to model discrete-valued data; 2) tabular data is column permutation-invariant, requiring a generation model to generate columns in arbitrary order. This paper proposes a Diffusion-nested Autoregressive model (TabDAR) to address these issues. To enable autoregressive methods for continuous columns, TabDAR employs a diffusion model to parameterize the conditional distribution of continuous features. To ensure arbitrary generation order, TabDAR resorts to masked transformers with bi-directional attention, which simulate various permutations of column order, hence enabling it to learn the conditional distribution of a target column given an arbitrary combination of other columns. These designs enable TabDAR to not only freely handle heterogeneous tabular data but also support convenient and flexible unconditional/conditional sampling. We conduct extensive experiments on ten datasets with distinct properties, and the proposed TabDAR outperforms previous state-of-the-art methods by 18% to 45% on eight metrics across three distinct aspects.

Paper Structure

This paper contains 57 sections, 29 equations, 16 figures, 15 tables, 5 algorithms.

Figures (16)

  • Figure 1: Challenges in Auto-Regressive tabular data generation. (a) The conditional distribution of continuous columns is hard to express. (b) Tabular data is column-permutation-invariant.
  • Figure 2: With appropriate masking, bidirectional attention is equivalent to causal attention in arbitrary order.
  • Figure 3: The overall framework of TabDAR. An embedding layer first encodes each column into a vector. The masks are then added to the target columns ('age' and 'education'). With Bi-direction Transformers' decoding, the output vectors ${\bm{z}}$ are used as conditions for predicting the distribution of current columns. TabDAR uses the conditional diffusion loss for continuous columns and cross-entropy loss for discrete columns. Losses are computed only on the masked tokens
  • Figure 3: Probability that a synthetic example's DCR to the training set rather than that of the holdout set), a score closer to $50\%$ is better.
  • Figure 4: An illustration of TabDAR's generation process. Given a random generation order, e.g., 'capital gain' $\rightarrow$ 'education' $\rightarrow \cdots$ 'income', TabDAR generates the value for each column in a row according to the conditional distribution learned by the masked Transformers.
  • ...and 11 more figures