Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees
Alexia Jolicoeur-Martineau, Kilian Fatras, Tal Kachman
TL;DR
The paper tackles missing data and limited sample sizes in tabular datasets by introducing Forest-Diffusion and Forest-Flow, diffusion- and flow-based generative/imputation models that replace neural score/vector-field estimators with XGBoost. It builds an extensive benchmark across 27 real-world datasets and 9 metrics, showing that the proposed tree-based approach can outperform deep-learning generative baselines for data generation and remains competitive for imputation, while enabling CPU-only training. The key contributions are the first diffusion/flow methods for tabular data using Gradient-Boosted Trees, a scalable training recipe that duplicates data to estimate expectations, and comprehensive ablations demonstrating the practical trade-offs of noise-level modeling and conditioning. The work highlights the practical impact of using tree-based methods for mixed-type tabular data, enabling realistic synthetic data generation and robust imputations without specialized hardware, though it notes that traditional imputation methods may still excel in some settings and outlines avenues for future improvements.
Abstract
Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at https://github.com/SamsungSAILMontreal/ForestDiffusion.
