Downstream Task-Oriented Generative Model Selections on Synthetic Data Training for Fraud Detection Models

Yinan Cheng; Chi-Hua Wang; Vamsi K. Potluru; Tucker Balch; Guang Cheng

Downstream Task-Oriented Generative Model Selections on Synthetic Data Training for Fraud Detection Models

Yinan Cheng, Chi-Hua Wang, Vamsi K. Potluru, Tucker Balch, Guang Cheng

TL;DR

This paper tackles the problem of selecting downstream task-oriented generative models for synthetic data in fraud detection, comparing neural-network-based and Bayesian-network-based generators under interpretability constraints. It introduces an evaluation framework spanning data, model classes, and performance metrics, and conducts extensive experiments on a highly imbalanced credit card fraud dataset. Key findings show that BN-based generators often outperform NN-based ones under strict interpretability, with metric-dependent utility across accuracy, AUROC, recall, precision, and F1; CTGAN and DataSynthesizer variants emerge as strong options depending on the target metric. The work provides practical guidance for practitioners aiming to replace real training data with synthetic data and points to future directions in generative model auditing to improve trust and reliability in data-centric ML workflows.

Abstract

Devising procedures for downstream task-oriented generative model selections is an unresolved problem of practical importance. Existing studies focused on the utility of a single family of generative models. They provided limited insights on how synthetic data practitioners select the best family generative models for synthetic training tasks given a specific combination of machine learning model class and performance metric. In this paper, we approach the downstream task-oriented generative model selections problem in the case of training fraud detection models and investigate the best practice given different combinations of model interpretability and model performance constraints. Our investigation supports that, while both Neural Network(NN)-based and Bayesian Network(BN)-based generative models are both good to complete synthetic training task under loose model interpretability constrain, the BN-based generative models is better than NN-based when synthetic training fraud detection model under strict model interpretability constrain. Our results provides practical guidance for machine learning practitioner who is interested in replacing their training dataset from real to synthetic, and shed lights on more general downstream task-oriented generative model selection problems.

Downstream Task-Oriented Generative Model Selections on Synthetic Data Training for Fraud Detection Models

TL;DR

Abstract

Paper Structure (18 sections, 11 figures, 1 table)

This paper contains 18 sections, 11 figures, 1 table.

Introduction
Contributions
Paper Organization
Relate Work
Synthetic Data Generative Models
Fraud Detection Model Interpretability
Fraud Detection Metric Utility
Experiment Setup
Training Data Synthesis
Choice of Fraud Detection Model Class
Choice of Fraud Detection Utility Metrics
Evaluation
Comparison of Original to Synthetic Data
Results on Utility-oriented GMS
Results on Interpretability-oriented GMS
...and 3 more sections

Figures (11)

Figure 1: Results to solve Utility-oriented GMS: Utility Metrics for Fraud Detection Classifiers
Figure 2: Results to solve Interpretability-oriented GMS: Precision-Recall curve and Average Precision
Figure 3: Accuracy. CTGAN-augmented training dataset damages synthetic trained classifier utility. PrivBayes-augmented training dataset improves synthetic trained classifier utility.
Figure 4: Recall. CTGAN-augmented training dataset improves synthetic trained classifier utility. PrivBayes-augmented training dataset damages synthetic trained classifier utility.
Figure 5: AUROC. CTGAN-augmented training dataset improves synthetic trained classifier utility. PrivBayes-augmented training dataset damages synthetic trained classifier utility.
...and 6 more figures

Downstream Task-Oriented Generative Model Selections on Synthetic Data Training for Fraud Detection Models

TL;DR

Abstract

Downstream Task-Oriented Generative Model Selections on Synthetic Data Training for Fraud Detection Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)