Downstream Task-Oriented Generative Model Selections on Synthetic Data Training for Fraud Detection Models
Yinan Cheng, Chi-Hua Wang, Vamsi K. Potluru, Tucker Balch, Guang Cheng
TL;DR
This paper tackles the problem of selecting downstream task-oriented generative models for synthetic data in fraud detection, comparing neural-network-based and Bayesian-network-based generators under interpretability constraints. It introduces an evaluation framework spanning data, model classes, and performance metrics, and conducts extensive experiments on a highly imbalanced credit card fraud dataset. Key findings show that BN-based generators often outperform NN-based ones under strict interpretability, with metric-dependent utility across accuracy, AUROC, recall, precision, and F1; CTGAN and DataSynthesizer variants emerge as strong options depending on the target metric. The work provides practical guidance for practitioners aiming to replace real training data with synthetic data and points to future directions in generative model auditing to improve trust and reliability in data-centric ML workflows.
Abstract
Devising procedures for downstream task-oriented generative model selections is an unresolved problem of practical importance. Existing studies focused on the utility of a single family of generative models. They provided limited insights on how synthetic data practitioners select the best family generative models for synthetic training tasks given a specific combination of machine learning model class and performance metric. In this paper, we approach the downstream task-oriented generative model selections problem in the case of training fraud detection models and investigate the best practice given different combinations of model interpretability and model performance constraints. Our investigation supports that, while both Neural Network(NN)-based and Bayesian Network(BN)-based generative models are both good to complete synthetic training task under loose model interpretability constrain, the BN-based generative models is better than NN-based when synthetic training fraud detection model under strict model interpretability constrain. Our results provides practical guidance for machine learning practitioner who is interested in replacing their training dataset from real to synthetic, and shed lights on more general downstream task-oriented generative model selection problems.
