Table of Contents
Fetching ...

A Sobering Look at Tabular Data Generation via Probabilistic Circuits

Davide Scassola, Dylan Ponsford, Adrián Javaloy, Sebastiano Saccani, Luca Bortolussi, Henry Gouk, Antonio Vergari

Abstract

Tabular data is more challenging to generate than text and images, due to its heterogeneous features and much lower sample sizes. On this task, diffusion-based models are the current state-of-the-art (SotA) model class, achieving almost perfect performance on commonly used benchmarks. In this paper, we question the perception of progress for tabular data generation. First, we highlight the limitations of current protocols to evaluate the fidelity of generated data, and advocate for alternative ones. Next, we revisit a simple baseline -- hierarchical mixture models in the form of deep probabilistic circuits (PCs) -- which delivers competitive or superior performance to SotA models for a fraction of the cost. PCs are the generative counterpart of decision forests, and as such can natively handle heterogeneous data as well as deliver tractable probabilistic generation and inference. Finally, in a rigorous empirical analysis we show that the apparent saturation of progress for SotA models is largely due to the use of inadequate metrics. As such, we highlight that there is still much to be done to generate realistic tabular data. Code available at https://github.com/april-tools/tabpc.

A Sobering Look at Tabular Data Generation via Probabilistic Circuits

Abstract

Tabular data is more challenging to generate than text and images, due to its heterogeneous features and much lower sample sizes. On this task, diffusion-based models are the current state-of-the-art (SotA) model class, achieving almost perfect performance on commonly used benchmarks. In this paper, we question the perception of progress for tabular data generation. First, we highlight the limitations of current protocols to evaluate the fidelity of generated data, and advocate for alternative ones. Next, we revisit a simple baseline -- hierarchical mixture models in the form of deep probabilistic circuits (PCs) -- which delivers competitive or superior performance to SotA models for a fraction of the cost. PCs are the generative counterpart of decision forests, and as such can natively handle heterogeneous data as well as deliver tractable probabilistic generation and inference. Finally, in a rigorous empirical analysis we show that the apparent saturation of progress for SotA models is largely due to the use of inadequate metrics. As such, we highlight that there is still much to be done to generate realistic tabular data. Code available at https://github.com/april-tools/tabpc.
Paper Structure (46 sections, 3 theorems, 21 equations, 15 figures, 19 tables, 1 algorithm)

This paper contains 46 sections, 3 theorems, 21 equations, 15 figures, 19 tables, 1 algorithm.

Key Result

Theorem 2.1

Let $\mathcal{D}_R$ with $|\mathcal{D}_R| = n$ be a real dataset, and let $p$ be a eq:ff_model model trained by MLE on this dataset. Then where $\mathcal{D}_S$ is an i.i.d. sample of $n$ items drawn from $p$.

Figures (15)

  • Figure 1: PCs for tabular data (TabPC) compete with diffusion-based approaches at a fraction of the cost for all datasets (denoted by marker shape, see \ref{['fig:random_FF_vs_trained_FF']}) on fidelity metrics such as $\alpha$-precisionfaithful and \ref{['eq:c2st']}lopez-paz2017revisiting when computed with an XGBoost classifier. We remark this is not the commonly used implementation of \ref{['eq:c2st']}, which instead uses a logistic regressor for which even a fully-factorised model yields SotA results, delivering a false sense of progress (see \ref{['fig:xgb_vs_lr']}).
  • Figure 2: wNMIS is invariant to the quality of the univariate marginals while Trend is not, as shown by the fact that it assigns almost identical scores to \ref{['eq:ff_model']} models trained via MLE and to \ref{['eq:ff_model']} models with randomly initialised parameters across all datasets (denoted by marker shape), lying on the identity denoted as the grey dashed line. In contrast, Trend says that a trained \ref{['eq:ff_model']} model captures bivariate dependencies well, and an untrained model does not.
  • Figure 3: \ref{['eq:c2st']} with XGBoost offers a much clearer stratification of model performance than with LR across all datasets, and is able to correctly separate out the performance of the trivial \ref{['eq:ff_model']} model. Details in \ref{['sec:trend']}.
  • Figure 4: Constructing TabPC requires three choices: the region graph (RG), level of overparameterisation, and type of sum-product layers.\ref{['fig:sub1']} shows a tree-shaped RG over three variables. This acts as a template from which we construct the simple circuit shown in \ref{['fig:sub2']}. To increase expressivity, we overparameterise the circuit by populating it with $K$ units organised in layers, as seen in \ref{['fig:sub3']} for $K=3$. Finally, the choice of sum-product layer dictates how we connect units across layers. TabPC uses CP sum-product layers, also shown in \ref{['fig:sub3']}, which are described in detail below.
  • Figure 5: Validation bits-per-dimension (BPD) provides a strong signal of downstream sample quality as measured by \ref{['eq:c2st']} (XGB), as displayed here for the Adult (left) and Magic (right) datasets (full results in \ref{['app:ll_plots']}). Each point represents a different hyperparameter configuration for TabPC, across number of sum and input units (colour), batch size (marker size), and learning rate (marker style). The dashed grey line is a Huber regression fit.
  • ...and 10 more figures

Theorems & Definitions (5)

  • Theorem 2.1
  • Lemma A.1
  • proof
  • Theorem A.2
  • proof