Generating Tabular Data Using Heterogeneous Sequential Feature Forest Flow Matching
Ange-Clément Akazan, Alexia Jolicoeur-Martineau, Ioannis Mitliagkas
TL;DR
This work tackles efficient and faithful tabular data generation under privacy constraints, addressing Forest Flow's (FF) limitations in speed and handling of categorical features. It introduces CS3F and HS3F, sequential-generation frameworks that learn per-feature velocity fields for continuous data and use multinomial sampling from XGBoost classifiers for categoricals, with HS3F applying Runge-Kutta 4th order ODE integration for accuracy. Across 25 real-world datasets, HS3F consistently improves distributional similarity, diversity, and downstream predictive performance, while delivering substantial speedups (notably $21$–$27\times$ faster on datasets with at least $20\%$ categorical features) compared to FF. The findings demonstrate robustness to affine changes in the ODE initial condition and suggest that sequential heterogeneous generation offers a promising direction for advancing tabular data synthesis and diffusion-inspired generative modeling.
Abstract
Privacy and regulatory constraints make data generation vital to advancing machine learning without relying on real-world datasets. A leading approach for tabular data generation is the Forest Flow (FF) method, which combines Flow Matching with XGBoost. Despite its good performance, FF is slow and makes errors when treating categorical variables as one-hot continuous features. It is also highly sensitive to small changes in the initial conditions of the ordinary differential equation (ODE). To overcome these limitations, we develop Heterogeneous Sequential Feature Forest Flow (HS3F). Our method generates data sequentially (feature-by-feature), reducing the dependency on noisy initial conditions through the additional information from previously generated features. Furthermore, it generates categorical variables using multinomial sampling (from an XGBoost classifier) instead of flow matching, improving generation speed. We also use a Runge-Kutta 4th order (Rg4) ODE solver for improved performance over the Euler solver used in FF. Our experiments with 25 datasets reveal that HS3F produces higher quality and more diverse synthetic data than FF, especially for categorical variables. It also generates data 21-27 times faster for datasets with $\geq20%$ categorical variables. HS3F further demonstrates enhanced robustness to affine transformation in flow ODE initial conditions compared to FF. This study not only validates the HS3F but also unveils promising new strategies to advance generative models.
