Impact of Sampling Techniques and Data Leakage on XGBoost Performance in Credit Card Fraud Detection
Siyaxolisa Kabane
TL;DR
The paper tackles the problem of fraud detection under extreme class imbalance by evaluating how the timing of sampling techniques affects XGBoost performance. It compares three scenarios (no imbalance handling, pre-split sampling, and post-split sampling) on a large Kaggle credit card dataset, highlighting the risk of data leakage when sampling occurs before the train-test split. The findings show that pre-split sampling can artificially inflate performance, while post-split sampling yields reliable, leakage-free metrics across baseline and CGAN/SMOTE configurations. The study advocates post-split, carefully implemented sampling as best practice and calls for rigorous evaluation protocols to ensure realistic deployment of fraud-detection models.
Abstract
Credit card fraud detection remains a critical challenge in financial security, with machine learning models like XGBoost(eXtreme gradient boosting) emerging as powerful tools for identifying fraudulent transactions. However, the inherent class imbalance in credit card transaction datasets poses significant challenges for model performance. Although sampling techniques are commonly used to address this imbalance, their implementation sometimes precedes the train-test split, potentially introducing data leakage. This study presents a comparative analysis of XGBoost's performance in credit card fraud detection under three scenarios: Firstly without any imbalance handling techniques, secondly with sampling techniques applied only to the training set after the train-test split, and third with sampling techniques applied before the train-test split. We utilized a dataset from Kaggle of 284,807 credit card transactions, containing 0.172\% fraudulent cases, to evaluate these approaches. Our findings show that although sampling strategies enhance model performance, the reliability of results is greatly impacted by when they are applied. Due to a data leakage issue that frequently occurs in machine learning models during the sampling phase, XGBoost models trained on data where sampling was applied prior to the train-test split may have displayed artificially inflated performance metrics. Surprisingly, models trained with sampling techniques applied solely to the training set demonstrated significantly lower results than those with pre-split sampling, all the while preserving the integrity of the evaluation process.
