Table of Contents
Fetching ...

Generating Tabular Data Using Heterogeneous Sequential Feature Forest Flow Matching

Ange-Clément Akazan, Alexia Jolicoeur-Martineau, Ioannis Mitliagkas

TL;DR

This work tackles efficient and faithful tabular data generation under privacy constraints, addressing Forest Flow's (FF) limitations in speed and handling of categorical features. It introduces CS3F and HS3F, sequential-generation frameworks that learn per-feature velocity fields for continuous data and use multinomial sampling from XGBoost classifiers for categoricals, with HS3F applying Runge-Kutta 4th order ODE integration for accuracy. Across 25 real-world datasets, HS3F consistently improves distributional similarity, diversity, and downstream predictive performance, while delivering substantial speedups (notably $21$–$27\times$ faster on datasets with at least $20\%$ categorical features) compared to FF. The findings demonstrate robustness to affine changes in the ODE initial condition and suggest that sequential heterogeneous generation offers a promising direction for advancing tabular data synthesis and diffusion-inspired generative modeling.

Abstract

Privacy and regulatory constraints make data generation vital to advancing machine learning without relying on real-world datasets. A leading approach for tabular data generation is the Forest Flow (FF) method, which combines Flow Matching with XGBoost. Despite its good performance, FF is slow and makes errors when treating categorical variables as one-hot continuous features. It is also highly sensitive to small changes in the initial conditions of the ordinary differential equation (ODE). To overcome these limitations, we develop Heterogeneous Sequential Feature Forest Flow (HS3F). Our method generates data sequentially (feature-by-feature), reducing the dependency on noisy initial conditions through the additional information from previously generated features. Furthermore, it generates categorical variables using multinomial sampling (from an XGBoost classifier) instead of flow matching, improving generation speed. We also use a Runge-Kutta 4th order (Rg4) ODE solver for improved performance over the Euler solver used in FF. Our experiments with 25 datasets reveal that HS3F produces higher quality and more diverse synthetic data than FF, especially for categorical variables. It also generates data 21-27 times faster for datasets with $\geq20%$ categorical variables. HS3F further demonstrates enhanced robustness to affine transformation in flow ODE initial conditions compared to FF. This study not only validates the HS3F but also unveils promising new strategies to advance generative models.

Generating Tabular Data Using Heterogeneous Sequential Feature Forest Flow Matching

TL;DR

This work tackles efficient and faithful tabular data generation under privacy constraints, addressing Forest Flow's (FF) limitations in speed and handling of categorical features. It introduces CS3F and HS3F, sequential-generation frameworks that learn per-feature velocity fields for continuous data and use multinomial sampling from XGBoost classifiers for categoricals, with HS3F applying Runge-Kutta 4th order ODE integration for accuracy. Across 25 real-world datasets, HS3F consistently improves distributional similarity, diversity, and downstream predictive performance, while delivering substantial speedups (notably faster on datasets with at least categorical features) compared to FF. The findings demonstrate robustness to affine changes in the ODE initial condition and suggest that sequential heterogeneous generation offers a promising direction for advancing tabular data synthesis and diffusion-inspired generative modeling.

Abstract

Privacy and regulatory constraints make data generation vital to advancing machine learning without relying on real-world datasets. A leading approach for tabular data generation is the Forest Flow (FF) method, which combines Flow Matching with XGBoost. Despite its good performance, FF is slow and makes errors when treating categorical variables as one-hot continuous features. It is also highly sensitive to small changes in the initial conditions of the ordinary differential equation (ODE). To overcome these limitations, we develop Heterogeneous Sequential Feature Forest Flow (HS3F). Our method generates data sequentially (feature-by-feature), reducing the dependency on noisy initial conditions through the additional information from previously generated features. Furthermore, it generates categorical variables using multinomial sampling (from an XGBoost classifier) instead of flow matching, improving generation speed. We also use a Runge-Kutta 4th order (Rg4) ODE solver for improved performance over the Euler solver used in FF. Our experiments with 25 datasets reveal that HS3F produces higher quality and more diverse synthetic data than FF, especially for categorical variables. It also generates data 21-27 times faster for datasets with categorical variables. HS3F further demonstrates enhanced robustness to affine transformation in flow ODE initial conditions compared to FF. This study not only validates the HS3F but also unveils promising new strategies to advance generative models.

Paper Structure

This paper contains 33 sections, 17 equations, 13 figures, 5 tables, 4 algorithms.

Figures (13)

  • Figure 1: Iris Data: Three-Way Interaction Plot of Sepal Width vs. Length by Species, Comparing Real Data with Data Generated by ForestFlow, jolicoeurmartineau2024generating, and by H3SF Using Euler and Runge Kutta 4th Order Solvers
  • Figure 2: Data generation time comparison per models across generated datasets. We see that our proposed method HS3F is much faster than CS3F and Forest Flow, especially for datasets with many categorical features.
  • Figure 3: 500 samples 2-Moons data set with and its generated version from HS3F-Euler, HS3F-Rg4 and ForestFlow
  • Figure 4: HS3F-based Euler Solver (orange and green rectangle, showing the generation process for both continuous and categorical data) and CS3F-based Euler Solver (encompasses the area inside the green rounded rectangle, showing the continuous data generation). $X=\{x^1,\dots,x^K\}$ is the original data set and $\{z^1,\dots,z^K\}$ is a set containing standard Gaussian feature vectors. The green arrows and steps indicates the generative process continuous features generation while the orange ones indicates that of categorical features. step 0 and 3 contain both colors, which means that both methods use these steps except that the categorical feature generation does not use the set of standard Gaussian noise.
  • Figure 5: Wasserstein train comparison across datasets
  • ...and 8 more figures