Table of Contents
Fetching ...

XGenBoost: Synthesizing Small and Large Tabular Datasets with XGBoost

Jim Achterberg, Marcel Haas, Bram van Dijk, Marco Spruit

TL;DR

XGenBoost presents XGenBoost, a pair of generative models based on XGBoost, a Denoising Diffusion Implicit Model with XGBoost as score-estimator suited for smaller datasets, and a hierarchical autoregressive model whose conditionals are learned via XGBoost classifiers, suited for large-scale tabular synthesis.

Abstract

Tree ensembles such as XGBoost are often preferred for discriminative tasks in mixed-type tabular data, due to their inductive biases, minimal hyperparameter tuning, and training efficiency. We argue that these qualities, when leveraged correctly, can make for better generative models as well. As such, we present XGenBoost, a pair of generative models based on XGBoost: i) a Denoising Diffusion Implicit Model (DDIM) with XGBoost as score-estimator suited for smaller datasets, and ii) a hierarchical autoregressive model whose conditionals are learned via XGBoost classifiers, suited for large-scale tabular synthesis. The architectures follow from the natural constraints imposed by tree-based learners, e.g., in the diffusion model, combining Gaussian and multinomial diffusion to leverage native categorical splits and avoid one-hot encoding while accurately modelling mixed data types. In the autoregressive model, we use a fixed-order factorization, a hierarchical classifier to impose ordinal inductive biases when modelling numerical features, and de-quantization based on empirical quantile functions to model the non-continuous nature of most real-world tabular datasets. Through two benchmarks, one containing smaller and the other larger datasets, we show that our proposed architectures outperform previous neural- and tree-based generative models for mixed-type tabular synthesis at lower training cost.

XGenBoost: Synthesizing Small and Large Tabular Datasets with XGBoost

TL;DR

XGenBoost presents XGenBoost, a pair of generative models based on XGBoost, a Denoising Diffusion Implicit Model with XGBoost as score-estimator suited for smaller datasets, and a hierarchical autoregressive model whose conditionals are learned via XGBoost classifiers, suited for large-scale tabular synthesis.

Abstract

Tree ensembles such as XGBoost are often preferred for discriminative tasks in mixed-type tabular data, due to their inductive biases, minimal hyperparameter tuning, and training efficiency. We argue that these qualities, when leveraged correctly, can make for better generative models as well. As such, we present XGenBoost, a pair of generative models based on XGBoost: i) a Denoising Diffusion Implicit Model (DDIM) with XGBoost as score-estimator suited for smaller datasets, and ii) a hierarchical autoregressive model whose conditionals are learned via XGBoost classifiers, suited for large-scale tabular synthesis. The architectures follow from the natural constraints imposed by tree-based learners, e.g., in the diffusion model, combining Gaussian and multinomial diffusion to leverage native categorical splits and avoid one-hot encoding while accurately modelling mixed data types. In the autoregressive model, we use a fixed-order factorization, a hierarchical classifier to impose ordinal inductive biases when modelling numerical features, and de-quantization based on empirical quantile functions to model the non-continuous nature of most real-world tabular datasets. Through two benchmarks, one containing smaller and the other larger datasets, we show that our proposed architectures outperform previous neural- and tree-based generative models for mixed-type tabular synthesis at lower training cost.
Paper Structure (52 sections, 15 equations, 3 figures, 30 tables)

This paper contains 52 sections, 15 equations, 3 figures, 30 tables.

Figures (3)

  • Figure 1: Training times in minutes of XGenB-DF (v-DDIM specification) and XGenB-AR in the Small and Big Benchmark, respectively.
  • Figure 2: Violinplot of metric scores for various levels of dropout in XGenB-DF, evaluated for 20 sampling seeds per dataset in the Small Benchmark. MLE uses ROCAUC and $R^2$ for classification and regression, respectively.
  • Figure 3: Violinplot of metric scores for various quantization strategies (Quantile, Uniform, KMeans) and de-quantization strategies (Uniform-sampling, EQF-sampling) in XGenBoost's autoregressive model, evaluated for 20 sampling seeds per dataset in the Big Benchmark. MLE shows ROCAUC and $R^2$ for classification and regression datasets, respectively.