Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models

Zeyu Yang; Han Yu; Peikun Guo; Khadija Zanna; Xiaoxue Yang; Akane Sano

Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models

Zeyu Yang, Han Yu, Peikun Guo, Khadija Zanna, Xiaoxue Yang, Akane Sano

TL;DR

The paper tackles the bias inherent in training data when synthesizing mixed-type tabular data by introducing a diffusion-model framework that conditions on both the target label and multiple sensitive attributes. It advances diffusion-based tabular synthesis with multivariate latent guidance and balanced sampling to produce data whose joint distribution over outcomes and sensitivities is fairer, achieving >10% improvements on demographic parity and equalized odds metrics while maintaining competitive fidelity (average AUC around 84.7%). The approach relies on a U-Net with transformers as the posterior estimator in latent space and employs classifier-free guidance extended to multiple conditions, regulated by a security gate and momentum terms to prevent over-correction. Empirical results on Adult, Bank, and COMPAS demonstrate superior fairness scores compared to baselines, with analysis of balanced sensitive distributions and a quantified trade-off between performance and fairness, offering practical implications for fair data sharing and downstream decision-making.

Abstract

Diffusion models have emerged as a robust framework for various generative tasks, including tabular data synthesis. However, current tabular diffusion models tend to inherit bias in the training dataset and generate biased synthetic data, which may influence discriminatory actions. In this research, we introduce a novel tabular diffusion model that incorporates sensitive guidance to generate fair synthetic data with balanced joint distributions of the target label and sensitive attributes, such as sex and race. The empirical results demonstrate that our method effectively mitigates bias in training data while maintaining the quality of the generated samples. Furthermore, we provide evidence that our approach outperforms existing methods for synthesizing tabular data on fairness metrics such as demographic parity ratio and equalized odds ratio, achieving improvements of over $10\%$. Our implementation is available at https://github.com/comp-well-org/fair-tab-diffusion.

Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models

TL;DR

Abstract

. Our implementation is available at https://github.com/comp-well-org/fair-tab-diffusion.

Paper Structure (41 sections, 26 equations, 7 figures, 9 tables)

This paper contains 41 sections, 26 equations, 7 figures, 9 tables.

Introduction
Related Work
Bias in Machine Learning
Fair Data Synthesis
Fair/Safe Diffusion Models
Diffusion Models
Gaussian Diffusion Kernel
Multinomial Diffusion Kernel
Model Fitting
Classifier-Free Guidance
Methods
Multivariate Latent Guidance
Backbone
Balanced Sampling
Experiments
...and 26 more sections

Figures (7)

Figure 1: The diagram of our model architecture. In the forward process, the input data point $\mathbf{x}$ is pre-processed into numerical part $\mathbf{x}_{\text{num}}$ and categorical part $\mathbf{x}_{\text{cat}}$, and then passing through $T$ steps of the diffusion kernel to get $\mathbf{x}_{1}, \cdots, \mathbf{x}_{T}$. $\mathbf{z}_{1}, \cdots, \mathbf{z}_{T}$ is the latent representation of $\mathbf{x}_{1}, \cdots, \mathbf{x}_{T}$. In the reverse process, the posterior estimator iteratively denoises noisy input $\mathbf{z}_{T}$ conditioning on an outcome $\mathbf{c}$ and $N$ sensitive attributes $\mathbf{s}^{(1)}, \cdots, \mathbf{s}^{(N)}$. The estimated data point is $\hat{\mathbf{x}}$. Our model can generate fair synthetic data by leveraging sensitive guidance to ensure a balanced joint distribution of the target label and sensitive attributes.
Figure 2: The distribution of sensitive attributes across different target label values on the Adult dataset using our method.
Figure 3: Comparison in the real versus synthetic distribution of sensitive attributes across all datasets.
Figure 4: Trade-off between performance and fairness across balancing levels for experimental datasets. SCORE was computed as a weighted sum of these metrics, with weights of 0.5 for AUC, 0.25 for DPR, and 0.25 for EOR. The best composite scores for the Adult and Bank datasets are achieved at a balancing level of 10, while the COMPAS dataset achieves the best score at level 9.
Figure 5: Comparison of models on the Adult dataset.
...and 2 more figures

Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models

TL;DR

Abstract

Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)