Table of Contents
Fetching ...

Improve Fidelity and Utility of Synthetic Credit Card Transaction Time Series from Data-centric Perspective

Din-Yin Hsieh, Chi-Hua Wang, Guang Cheng

TL;DR

This paper tackles the problem of producing high-fidelity and practically useful synthetic credit card transaction time series under privacy constraints by adopting a data-centric preprocessing strategy. It introduces five preprocessing schemas applied to CPAR within the SDV framework to improve both fidelity and downstream fraud-detection utility, with Schema 5 yielding the closest match to the original data distributions, particularly for the Amount variable. The authors evaluate downstream fraud-detection models (CatBoost, XGBoost, LGBM) trained on synthetic data, finding CatBoost and LGBM to outperform XGBoost and demonstrating robustness to encoding schemes. The work provides actionable guidance for practitioners in finance on when and how to preprocess data to maximize the utility of synthetic datasets for time-series fraud detection and broader synthetic-data methodologies.

Abstract

Exploring generative model training for synthetic tabular data, specifically in sequential contexts such as credit card transaction data, presents significant challenges. This paper addresses these challenges, focusing on attaining both high fidelity to actual data and optimal utility for machine learning tasks. We introduce five pre-processing schemas to enhance the training of the Conditional Probabilistic Auto-Regressive Model (CPAR), demonstrating incremental improvements in the synthetic data's fidelity and utility. Upon achieving satisfactory fidelity levels, our attention shifts to training fraud detection models tailored for time-series data, evaluating the utility of the synthetic data. Our findings offer valuable insights and practical guidelines for synthetic data practitioners in the finance sector, transitioning from real to synthetic datasets for training purposes, and illuminating broader methodologies for synthesizing credit card transaction time series.

Improve Fidelity and Utility of Synthetic Credit Card Transaction Time Series from Data-centric Perspective

TL;DR

This paper tackles the problem of producing high-fidelity and practically useful synthetic credit card transaction time series under privacy constraints by adopting a data-centric preprocessing strategy. It introduces five preprocessing schemas applied to CPAR within the SDV framework to improve both fidelity and downstream fraud-detection utility, with Schema 5 yielding the closest match to the original data distributions, particularly for the Amount variable. The authors evaluate downstream fraud-detection models (CatBoost, XGBoost, LGBM) trained on synthetic data, finding CatBoost and LGBM to outperform XGBoost and demonstrating robustness to encoding schemes. The work provides actionable guidance for practitioners in finance on when and how to preprocess data to maximize the utility of synthetic datasets for time-series fraud detection and broader synthetic-data methodologies.

Abstract

Exploring generative model training for synthetic tabular data, specifically in sequential contexts such as credit card transaction data, presents significant challenges. This paper addresses these challenges, focusing on attaining both high fidelity to actual data and optimal utility for machine learning tasks. We introduce five pre-processing schemas to enhance the training of the Conditional Probabilistic Auto-Regressive Model (CPAR), demonstrating incremental improvements in the synthetic data's fidelity and utility. Upon achieving satisfactory fidelity levels, our attention shifts to training fraud detection models tailored for time-series data, evaluating the utility of the synthetic data. Our findings offer valuable insights and practical guidelines for synthetic data practitioners in the finance sector, transitioning from real to synthetic datasets for training purposes, and illuminating broader methodologies for synthesizing credit card transaction time series.
Paper Structure (16 sections, 2 equations, 11 figures, 1 algorithm)

This paper contains 16 sections, 2 equations, 11 figures, 1 algorithm.

Figures (11)

  • Figure 1: Metadata Data Type of Original Credit Card Transaction Dataset. $S_{j}^{(i)}$ denotes the ith user's jth row.
  • Figure 2: Differences of Columns between Schemas
  • Figure 3: Marginal Distributions of columns Is Fraud?, User, Use Chip, Errors?, from original data set and Schema 1 to 5
  • Figure 4: Marginal Distribution of Location (first 50 entries) for Synthetic dataset, Schema 5
  • Figure 5: Amount Distribution for Original Dataset vs Synthetic Datasets Generated From Schema 3, 4, and 5
  • ...and 6 more figures