Table of Contents
Fetching ...

Data Augmentation via Diffusion Model to Enhance AI Fairness

Christina Hastings Blow, Lijun Qian, Camille Gibson, Pamela Obiomon, Xishuang Dong

TL;DR

Experimental results demonstrate that the synthetic data generated by Tab-DDPM improves fairness in binary classification, and five traditional machine learning models were used to validate the proposed approach.

Abstract

AI fairness seeks to improve the transparency and explainability of AI systems by ensuring that their outcomes genuinely reflect the best interests of users. Data augmentation, which involves generating synthetic data from existing datasets, has gained significant attention as a solution to data scarcity. In particular, diffusion models have become a powerful technique for generating synthetic data, especially in fields like computer vision. This paper explores the potential of diffusion models to generate synthetic tabular data to improve AI fairness. The Tabular Denoising Diffusion Probabilistic Model (Tab-DDPM), a diffusion model adaptable to any tabular dataset and capable of handling various feature types, was utilized with different amounts of generated data for data augmentation. Additionally, reweighting samples from AIF360 was employed to further enhance AI fairness. Five traditional machine learning models-Decision Tree (DT), Gaussian Naive Bayes (GNB), K-Nearest Neighbors (KNN), Logistic Regression (LR), and Random Forest (RF)-were used to validate the proposed approach. Experimental results demonstrate that the synthetic data generated by Tab-DDPM improves fairness in binary classification.

Data Augmentation via Diffusion Model to Enhance AI Fairness

TL;DR

Experimental results demonstrate that the synthetic data generated by Tab-DDPM improves fairness in binary classification, and five traditional machine learning models were used to validate the proposed approach.

Abstract

AI fairness seeks to improve the transparency and explainability of AI systems by ensuring that their outcomes genuinely reflect the best interests of users. Data augmentation, which involves generating synthetic data from existing datasets, has gained significant attention as a solution to data scarcity. In particular, diffusion models have become a powerful technique for generating synthetic data, especially in fields like computer vision. This paper explores the potential of diffusion models to generate synthetic tabular data to improve AI fairness. The Tabular Denoising Diffusion Probabilistic Model (Tab-DDPM), a diffusion model adaptable to any tabular dataset and capable of handling various feature types, was utilized with different amounts of generated data for data augmentation. Additionally, reweighting samples from AIF360 was employed to further enhance AI fairness. Five traditional machine learning models-Decision Tree (DT), Gaussian Naive Bayes (GNB), K-Nearest Neighbors (KNN), Logistic Regression (LR), and Random Forest (RF)-were used to validate the proposed approach. Experimental results demonstrate that the synthetic data generated by Tab-DDPM improves fairness in binary classification.

Paper Structure

This paper contains 14 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Tab-DDPM framework.
  • Figure 2: Flow of the proposed method.
  • Figure 3: Attribute distribution comparison between original data and synthetic data. It includes five attributes, namely, sex, race, education, work class, and occupation. Specifically, the attribute value "?" refers to missing data for the attributes. In addition, there are three cases for synthetic data including 20,000 synthetic samples, 100,000 synthetic samples,, and 150,000 synthetic samples for these attributes.
  • Figure 4: Performance comparison via BA vs. AOD before and after reweighting samples on Adult Income dataset with respect to the protected attribute $Race$ for LR. in addition, it examines the performance for the case $150,000$ synthetic samples.
  • Figure 5: Performance comparison via BA vs. AOD before and after reweighting samples on Adult Income dataset with respect to the protected attribute $Race$ for RF. in addition, it examines the performance for the case $100,000$ synthetic samples.