Table of Contents
Fetching ...

EmDT: Embedding Diffusion Transformer for Tabular Data Generation in Fraud Detection

En-Ya Kuo, Sebastien Motsch

Abstract

Imbalanced datasets pose a difficulty in fraud detection, as classifiers are often biased toward the majority class and perform poorly on rare fraudulent transactions. Synthetic data generation is therefore commonly used to mitigate this problem. In this work, we propose the Clustered Embedding Diffusion-Transformer (EmDT), a diffusion model designed to generate fraudulent samples. Our key innovation is to leverage UMAP clustering to identify distinct fraudulent patterns, and train a Transformer denoising network with sinusoidal positional embeddings to capture feature relationships throughout the diffusion process. Once the synthetic data has been generated, we employ a standard decision-tree-based classifier (e.g., XGBoost) for classification, as this type of model remains better suited to tabular datasets. Experiments on a credit card fraud detection dataset demonstrate that EmDT significantly improves downstream classification performance compared to existing oversampling and generative methods, while maintaining comparable privacy protection and preserving feature correlations present in the original data.

EmDT: Embedding Diffusion Transformer for Tabular Data Generation in Fraud Detection

Abstract

Imbalanced datasets pose a difficulty in fraud detection, as classifiers are often biased toward the majority class and perform poorly on rare fraudulent transactions. Synthetic data generation is therefore commonly used to mitigate this problem. In this work, we propose the Clustered Embedding Diffusion-Transformer (EmDT), a diffusion model designed to generate fraudulent samples. Our key innovation is to leverage UMAP clustering to identify distinct fraudulent patterns, and train a Transformer denoising network with sinusoidal positional embeddings to capture feature relationships throughout the diffusion process. Once the synthetic data has been generated, we employ a standard decision-tree-based classifier (e.g., XGBoost) for classification, as this type of model remains better suited to tabular datasets. Experiments on a credit card fraud detection dataset demonstrate that EmDT significantly improves downstream classification performance compared to existing oversampling and generative methods, while maintaining comparable privacy protection and preserving feature correlations present in the original data.
Paper Structure (17 sections, 11 equations, 8 figures, 3 tables, 3 algorithms)

This paper contains 17 sections, 11 equations, 8 figures, 3 tables, 3 algorithms.

Figures (8)

  • Figure 1: Overview of the proposed EmDT architecture. Starting from an imbalanced transaction dataset, the minority samples are first projected into a two-dimensional UMAP space, where distinct fraud clusters are identified (Step 1). A separate diffusion-based generative model is then trained for each cluster (Step 2), using sinusoidal embeddings and a Transformer architecture to generate synthetic fraud samples from normal samples. The synthetic data are combined with the original dataset to train a tree-based classifier such as XGBoost (Step 3), further improving the detection of fraudulent transactions.
  • Figure 2: Illustration of forward and reverse processes in the diffusion model. The forward process $q(\mathbf{x}_t \mid \mathbf{x}_{t-1})$ progressively corrupts the original data distribution $q_0$ (left) by adding Gaussian noise to each sample ${\bf x}_0$ over timesteps, eventually transforming the data into pure Gaussian noise $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ (right). The reverse process $p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t} \right)$ learns to reverse this noising process by iteratively denoising samples, recovering the data distribution from pure noise.
  • Figure 3: An Overview of the proposed EmDT model. In the forward process, Gaussian noise is gradually added to the fraud training samples. During the reverse process, the EmDT embeds the noisy inputs into higher-dimensional spaces and applies a Transformer to better capture feature relationships. Followed by a linear projection, the EmDT model learns to denoise the data and generate synthetic fraud samples.
  • Figure 4: Left: UMAP visualization of the Credit Card dataset ($N = 284{,}807$ samples, $d=29$ features). Fraudulent transactions (minority class, $n = 492$, $0.17\%$) are shown in red, while legitimate transactions (majority class, ${\approx}99.83\%$) are shown in blue. The substantial overlap between classes highlights the difficulty of the classification task. Right: UMAP projection restricted to fraudulent transactions only, revealing a structure of 3 distinct clusters.
  • Figure 5: Overview of the performance evaluation workflow for generative models. The procedure is divided into four steps. First, generative models based on EmDT are trained. Second, a tree-based classifier (XGBoost) is optimized using the training set and the synthetic data. Third, the hyperparameters of the EmDT and classifier are optimized based on the F1 score on the validation set. Fourth, the classifier is evaluated on the test set.
  • ...and 3 more figures