Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models
Zeyu Yang, Han Yu, Peikun Guo, Khadija Zanna, Xiaoxue Yang, Akane Sano
TL;DR
The paper tackles the bias inherent in training data when synthesizing mixed-type tabular data by introducing a diffusion-model framework that conditions on both the target label and multiple sensitive attributes. It advances diffusion-based tabular synthesis with multivariate latent guidance and balanced sampling to produce data whose joint distribution over outcomes and sensitivities is fairer, achieving >10% improvements on demographic parity and equalized odds metrics while maintaining competitive fidelity (average AUC around 84.7%). The approach relies on a U-Net with transformers as the posterior estimator in latent space and employs classifier-free guidance extended to multiple conditions, regulated by a security gate and momentum terms to prevent over-correction. Empirical results on Adult, Bank, and COMPAS demonstrate superior fairness scores compared to baselines, with analysis of balanced sensitive distributions and a quantified trade-off between performance and fairness, offering practical implications for fair data sharing and downstream decision-making.
Abstract
Diffusion models have emerged as a robust framework for various generative tasks, including tabular data synthesis. However, current tabular diffusion models tend to inherit bias in the training dataset and generate biased synthetic data, which may influence discriminatory actions. In this research, we introduce a novel tabular diffusion model that incorporates sensitive guidance to generate fair synthetic data with balanced joint distributions of the target label and sensitive attributes, such as sex and race. The empirical results demonstrate that our method effectively mitigates bias in training data while maintaining the quality of the generated samples. Furthermore, we provide evidence that our approach outperforms existing methods for synthesizing tabular data on fairness metrics such as demographic parity ratio and equalized odds ratio, achieving improvements of over $10\%$. Our implementation is available at https://github.com/comp-well-org/fair-tab-diffusion.
