Table of Contents
Fetching ...

Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

Alexia Jolicoeur-Martineau, Kilian Fatras, Tal Kachman

TL;DR

The paper tackles missing data and limited sample sizes in tabular datasets by introducing Forest-Diffusion and Forest-Flow, diffusion- and flow-based generative/imputation models that replace neural score/vector-field estimators with XGBoost. It builds an extensive benchmark across 27 real-world datasets and 9 metrics, showing that the proposed tree-based approach can outperform deep-learning generative baselines for data generation and remains competitive for imputation, while enabling CPU-only training. The key contributions are the first diffusion/flow methods for tabular data using Gradient-Boosted Trees, a scalable training recipe that duplicates data to estimate expectations, and comprehensive ablations demonstrating the practical trade-offs of noise-level modeling and conditioning. The work highlights the practical impact of using tree-based methods for mixed-type tabular data, enabling realistic synthetic data generation and robust imputations without specialized hardware, though it notes that traditional imputation methods may still excel in some settings and outlines avenues for future improvements.

Abstract

Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at https://github.com/SamsungSAILMontreal/ForestDiffusion.

Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

TL;DR

The paper tackles missing data and limited sample sizes in tabular datasets by introducing Forest-Diffusion and Forest-Flow, diffusion- and flow-based generative/imputation models that replace neural score/vector-field estimators with XGBoost. It builds an extensive benchmark across 27 real-world datasets and 9 metrics, showing that the proposed tree-based approach can outperform deep-learning generative baselines for data generation and remains competitive for imputation, while enabling CPU-only training. The key contributions are the first diffusion/flow methods for tabular data using Gradient-Boosted Trees, a scalable training recipe that duplicates data to estimate expectations, and comprehensive ablations demonstrating the practical trade-offs of noise-level modeling and conditioning. The work highlights the practical impact of using tree-based methods for mixed-type tabular data, enabling realistic synthetic data generation and robust imputations without specialized hardware, though it notes that traditional imputation methods may still excel in some settings and outlines avenues for future improvements.

Abstract

Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at https://github.com/SamsungSAILMontreal/ForestDiffusion.
Paper Structure (71 sections, 1 theorem, 8 equations, 47 figures, 10 tables, 7 algorithms)

This paper contains 71 sections, 1 theorem, 8 equations, 47 figures, 10 tables, 7 algorithms.

Key Result

Theorem 1

The unique vector field whose integration map satisfies $\rho_t(x) = \nu_t + \sigma_t x$ has the form

Figures (47)

  • Figure 1: Iris dataset: Three-way interaction between Petal length, width, and species using real or fake samples (using Forest-Flow, our XGBoost method, or TabDDPM and STaSy, deep-learning diffusion methods)
  • Figure 2: Illustration of our Forest-Flow method based on I-CFM tong2023improving (see §\ref{['app:fm_euc']} for more details). The first step duplicates the original dataset. The second step adds a different noise to each duplicated dataset. The third step computes the linear interpolation between the duplicated dataset and their corresponding noise for different time $t$ (i.e.,$\mathcal{X}_i(t) = t\mathcal{X} + (1-t) \mathcal{Z}_i, \forall i \in [1, \ldots, n_{noise}]$ and $\forall t \in t_{\text{levels}}$). The final step is to regress a GBT model at each noise level against the vector field; the training of the $n_t$ models is parallelized over CPUs.
  • Figure 3: Our method learns a different GBT model (with 100 trees) for each noise-level (here $n_t =4$). We can re-interpret this as a single model with giants trees where the time variable splits are hard-coded.
  • Figure 4: Average feature importance (SHAP value) across all XGBoost models trained with Forest-Flow on Iris dataset. The features (from 0 to 4) are, respectively, sepal length, sepal width, petal length, petal width, and species. Instead of average, one could also get a different plot per noise-level or variable predicted.
  • Figure :
  • ...and 42 more figures

Theorems & Definitions (1)

  • Theorem 1: Theorem 3 of lipman_flow_2022