Table of Contents
Fetching ...

Impact of Sampling Techniques and Data Leakage on XGBoost Performance in Credit Card Fraud Detection

Siyaxolisa Kabane

TL;DR

The paper tackles the problem of fraud detection under extreme class imbalance by evaluating how the timing of sampling techniques affects XGBoost performance. It compares three scenarios (no imbalance handling, pre-split sampling, and post-split sampling) on a large Kaggle credit card dataset, highlighting the risk of data leakage when sampling occurs before the train-test split. The findings show that pre-split sampling can artificially inflate performance, while post-split sampling yields reliable, leakage-free metrics across baseline and CGAN/SMOTE configurations. The study advocates post-split, carefully implemented sampling as best practice and calls for rigorous evaluation protocols to ensure realistic deployment of fraud-detection models.

Abstract

Credit card fraud detection remains a critical challenge in financial security, with machine learning models like XGBoost(eXtreme gradient boosting) emerging as powerful tools for identifying fraudulent transactions. However, the inherent class imbalance in credit card transaction datasets poses significant challenges for model performance. Although sampling techniques are commonly used to address this imbalance, their implementation sometimes precedes the train-test split, potentially introducing data leakage. This study presents a comparative analysis of XGBoost's performance in credit card fraud detection under three scenarios: Firstly without any imbalance handling techniques, secondly with sampling techniques applied only to the training set after the train-test split, and third with sampling techniques applied before the train-test split. We utilized a dataset from Kaggle of 284,807 credit card transactions, containing 0.172\% fraudulent cases, to evaluate these approaches. Our findings show that although sampling strategies enhance model performance, the reliability of results is greatly impacted by when they are applied. Due to a data leakage issue that frequently occurs in machine learning models during the sampling phase, XGBoost models trained on data where sampling was applied prior to the train-test split may have displayed artificially inflated performance metrics. Surprisingly, models trained with sampling techniques applied solely to the training set demonstrated significantly lower results than those with pre-split sampling, all the while preserving the integrity of the evaluation process.

Impact of Sampling Techniques and Data Leakage on XGBoost Performance in Credit Card Fraud Detection

TL;DR

The paper tackles the problem of fraud detection under extreme class imbalance by evaluating how the timing of sampling techniques affects XGBoost performance. It compares three scenarios (no imbalance handling, pre-split sampling, and post-split sampling) on a large Kaggle credit card dataset, highlighting the risk of data leakage when sampling occurs before the train-test split. The findings show that pre-split sampling can artificially inflate performance, while post-split sampling yields reliable, leakage-free metrics across baseline and CGAN/SMOTE configurations. The study advocates post-split, carefully implemented sampling as best practice and calls for rigorous evaluation protocols to ensure realistic deployment of fraud-detection models.

Abstract

Credit card fraud detection remains a critical challenge in financial security, with machine learning models like XGBoost(eXtreme gradient boosting) emerging as powerful tools for identifying fraudulent transactions. However, the inherent class imbalance in credit card transaction datasets poses significant challenges for model performance. Although sampling techniques are commonly used to address this imbalance, their implementation sometimes precedes the train-test split, potentially introducing data leakage. This study presents a comparative analysis of XGBoost's performance in credit card fraud detection under three scenarios: Firstly without any imbalance handling techniques, secondly with sampling techniques applied only to the training set after the train-test split, and third with sampling techniques applied before the train-test split. We utilized a dataset from Kaggle of 284,807 credit card transactions, containing 0.172\% fraudulent cases, to evaluate these approaches. Our findings show that although sampling strategies enhance model performance, the reliability of results is greatly impacted by when they are applied. Due to a data leakage issue that frequently occurs in machine learning models during the sampling phase, XGBoost models trained on data where sampling was applied prior to the train-test split may have displayed artificially inflated performance metrics. Surprisingly, models trained with sampling techniques applied solely to the training set demonstrated significantly lower results than those with pre-split sampling, all the while preserving the integrity of the evaluation process.

Paper Structure

This paper contains 27 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Class distribution of the credit card transactions dataset, showing a stark imbalance between legitimate (Class 0) and fraudulent (Class 1) transactions.
  • Figure 2: Boxplot of transaction amounts by class type, showing the distribution and range of amounts for legitimate (Class 0) and fraudulent (Class 1) transactions. Legitimate transactions exhibit a wider range, including high-value outliers, whereas fraudulent transactions tend to be lower in value.
  • Figure 3: Correlation heatmap of features in the dataset, showing relationships between variables.
  • Figure 4: Distribution of selected features in the dataset, comparing legitimate and fraudulent transactions. Features like V1, V2, and V3 show distinct patterns between classes, which could aid in fraud detection.