Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

Alexia Jolicoeur-Martineau; Kilian Fatras; Tal Kachman

Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

Alexia Jolicoeur-Martineau, Kilian Fatras, Tal Kachman

TL;DR

The paper tackles missing data and limited sample sizes in tabular datasets by introducing Forest-Diffusion and Forest-Flow, diffusion- and flow-based generative/imputation models that replace neural score/vector-field estimators with XGBoost. It builds an extensive benchmark across 27 real-world datasets and 9 metrics, showing that the proposed tree-based approach can outperform deep-learning generative baselines for data generation and remains competitive for imputation, while enabling CPU-only training. The key contributions are the first diffusion/flow methods for tabular data using Gradient-Boosted Trees, a scalable training recipe that duplicates data to estimate expectations, and comprehensive ablations demonstrating the practical trade-offs of noise-level modeling and conditioning. The work highlights the practical impact of using tree-based methods for mixed-type tabular data, enabling realistic synthetic data generation and robust imputations without specialized hardware, though it notes that traditional imputation methods may still excel in some settings and outlines avenues for future improvements.

Abstract

Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at https://github.com/SamsungSAILMontreal/ForestDiffusion.

Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

TL;DR

Abstract

Paper Structure (71 sections, 1 theorem, 8 equations, 47 figures, 10 tables, 7 algorithms)

This paper contains 71 sections, 1 theorem, 8 equations, 47 figures, 10 tables, 7 algorithms.

INTRODUCTION
BACKGROUND
Gradient-boosted trees and XGBoost
Generative diffusion and conditional flow matching models
SDEs and score-based models
ODEs and conditional flow matching
TRAINING DIFFUSION & FLOW MODELS WITH XGBOOST
Duplicating the dataset to estimate the expectation in diffusion and flow losses
Training different models per noise level
Choice of Gradient-boosted Trees
Forward Diffusion/Flow process
Imputation via diffusion and XGBoost
Data processing
Gradient-Boosted Tree hyperparameters
Training one model per category
...and 56 more sections

Key Result

Theorem 1

The unique vector field whose integration map satisfies $\rho_t(x) = \nu_t + \sigma_t x$ has the form

Figures (47)

Figure 1: Iris dataset: Three-way interaction between Petal length, width, and species using real or fake samples (using Forest-Flow, our XGBoost method, or TabDDPM and STaSy, deep-learning diffusion methods)
Figure 2: Illustration of our Forest-Flow method based on I-CFM tong2023improving (see §\ref{['app:fm_euc']} for more details). The first step duplicates the original dataset. The second step adds a different noise to each duplicated dataset. The third step computes the linear interpolation between the duplicated dataset and their corresponding noise for different time $t$ (i.e.,$\mathcal{X}_i(t) = t\mathcal{X} + (1-t) \mathcal{Z}_i, \forall i \in [1, \ldots, n_{noise}]$ and $\forall t \in t_{\text{levels}}$). The final step is to regress a GBT model at each noise level against the vector field; the training of the $n_t$ models is parallelized over CPUs.
Figure 3: Our method learns a different GBT model (with 100 trees) for each noise-level (here $n_t =4$). We can re-interpret this as a single model with giants trees where the time variable splits are hard-coded.
Figure 4: Average feature importance (SHAP value) across all XGBoost models trained with Forest-Flow on Iris dataset. The features (from 0 to 4) are, respectively, sepal length, sepal width, petal length, petal width, and species. Instead of average, one could also get a different plot per noise-level or variable predicted.
Figure :
...and 42 more figures

Theorems & Definitions (1)

Theorem 1: Theorem 3 of lipman_flow_2022

Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

TL;DR

Abstract

Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (47)

Theorems & Definitions (1)