Table of Contents
Fetching ...

Flow Matching for Tabular Data Synthesis

Bahrul Ilmi Nasution, Floor Eijkelboom, Mark Elliot, Richard Allmendinger, Christian A. Naesseth

TL;DR

The paper addresses the privacy-utility trade-off in sharing tabular data by evaluating flow matching (FM) and variational FM (VFM) as efficient alternatives to diffusion models. It introduces two implementations, TabbyFlow (data-space VFM) and TabSynFlow (latent-space FM), and systematically analyzes learning targets, trajectory choices (OT vs VP), and deterministic vs stochastic dynamics. Across census and benchmark datasets, FM-based methods, especially TabbyFlow with OT, deliver higher data utility at low computational cost, while VP and SDE variants offer selective privacy benefits under dataset-specific conditions. The findings provide practical guidance for statistical agencies seeking high-utility, privacy-conscious tabular data synthesis and point to future work on formal privacy guarantees and adaptive trajectory selection.

Abstract

Synthetic data generation is an important tool for privacy-preserving data sharing. While diffusion models have set recent benchmarks, flow matching (FM) offers a promising alternative. This paper presents different ways to implement flow matching for tabular data synthesis. We provide a comprehensive empirical study that compares flow matching (FM and variational FM) with a state-of-the-art diffusion method (TabDDPM and TabSyn) in tabular data synthesis. We evaluate both the standard Optimal Transport (OT) and the Variance Preserving (VP) probability paths, and also compare deterministic and stochastic samplers -- something possible when learning to generate using \textit{variational} flow matching -- characterising the empirical relationship between data utility and privacy risk. Our key findings reveal that flow matching, particularly TabbyFlow, outperforms diffusion baselines. Flow matching methods also achieves better performance with remarkably low function evaluations ($\leq$ 100 steps), offering a substantial computational advantage. The choice of probability path is also crucial, as using the OT path demonstrates superior performance, while VP has potential for producing synthetic data with lower disclosure risk. Lastly, our results show that making flows stochastic not only preserves marginal distributions but, in some instances, enables the generation of high utility synthetic data with reduced disclosure risk.

Flow Matching for Tabular Data Synthesis

TL;DR

The paper addresses the privacy-utility trade-off in sharing tabular data by evaluating flow matching (FM) and variational FM (VFM) as efficient alternatives to diffusion models. It introduces two implementations, TabbyFlow (data-space VFM) and TabSynFlow (latent-space FM), and systematically analyzes learning targets, trajectory choices (OT vs VP), and deterministic vs stochastic dynamics. Across census and benchmark datasets, FM-based methods, especially TabbyFlow with OT, deliver higher data utility at low computational cost, while VP and SDE variants offer selective privacy benefits under dataset-specific conditions. The findings provide practical guidance for statistical agencies seeking high-utility, privacy-conscious tabular data synthesis and point to future work on formal privacy guarantees and adaptive trajectory selection.

Abstract

Synthetic data generation is an important tool for privacy-preserving data sharing. While diffusion models have set recent benchmarks, flow matching (FM) offers a promising alternative. This paper presents different ways to implement flow matching for tabular data synthesis. We provide a comprehensive empirical study that compares flow matching (FM and variational FM) with a state-of-the-art diffusion method (TabDDPM and TabSyn) in tabular data synthesis. We evaluate both the standard Optimal Transport (OT) and the Variance Preserving (VP) probability paths, and also compare deterministic and stochastic samplers -- something possible when learning to generate using \textit{variational} flow matching -- characterising the empirical relationship between data utility and privacy risk. Our key findings reveal that flow matching, particularly TabbyFlow, outperforms diffusion baselines. Flow matching methods also achieves better performance with remarkably low function evaluations ( 100 steps), offering a substantial computational advantage. The choice of probability path is also crucial, as using the OT path demonstrates superior performance, while VP has potential for producing synthetic data with lower disclosure risk. Lastly, our results show that making flows stochastic not only preserves marginal distributions but, in some instances, enables the generation of high utility synthetic data with reduced disclosure risk.

Paper Structure

This paper contains 45 sections, 27 equations, 6 figures, 15 tables, 7 algorithms.

Figures (6)

  • Figure 1: Average utility ($\uparrow$) and disclosure risk ($\downarrow$) as a function of the number of function evaluations (NFEs) for TabSyn and flow matching models (TabSynFlow-OT, TabSynFlow-VP, TabbyFlow-OT, TabbyFlow-VP) averaged across four datasets Solid lines represent utility, while dotted lines represent risk. Flow-matching models converge after approximately 100 NFEs, achieving competitive utility at substantially lower computational cost compared to TabSyn.
  • Figure 2: Average utility ($\uparrow$) and risk ($\downarrow$) against ODE integration time ($t_{\text{ode}}$) for TabSynFlow and TabbyFlow using OT and VP paths. Colours and line styles indicate method and path (VP light and solid, OT dark and dashed). OT paths enable early stopping without significant utility loss, whereas VP paths require full integration to achieve competitive performance.
  • Figure 3: Utility and risk evaluation of TabSynFlow and TabbyFlow based on dataset and average using different number of evaluations with formula $2^n, n=[2, ..., 10]$. TabbyFlow-OT (and TabSynFlow-OT) achieve strong utility at low NFEs ($\leq100$) and tend to converge by $\approx$128, whereas TabSyn requires higher NFEs to catch up—so OT + flow-matching is attractive under tight compute budgets.
  • Figure 4: Effect of ODE integration time on utility–risk (per dataset + average) for TabSynFlow and TabbyFlow. Utility and risk evaluation of TabSynFlow and TabbyFlow based on dataset and average on different integration time from $t_{ode}=[0.6, 1]$ with interval 0.1. Colors encode method and path (VP vs OT). OT achieves high-utility solutions early (e.g., $t_{ode}=0.6$), while pushing to $t_{ode} \to 1$ often reduces utility and/or increases risk—so early stopping is preferable.
  • Figure 5: Utility and risk evaluation of TabSynFlow and TabbyFlow based on dataset and average on late integration time $t=[0.9, 0.95, 0.975, 1]$. For selected datasets, we zoom into the late integration window using the same color/marker scheme as Figure \ref{['fig:rq3-inttime-all']}. Take-home: Near full integration, instability emerges across several datasets (e.g., Fiji), consistent with utility drops and risk increases; OT remains comparatively robust but still degrades as $t_{ode} \to 1$.
  • ...and 1 more figures