Table of Contents
Fetching ...

Leveraging Ensemble-Based Semi-Supervised Learning for Illicit Account Detection in Ethereum DeFi Transactions

Shabnam Fazliani, Mohammad Mowlavi Sorond, Arsalan Masoudifard, Shaghayegh Fazliani

TL;DR

This work targets illicit account detection in Ethereum DeFi transactions by proposing SLEID, an ensemble-based semi-supervised framework that combines Isolation Forest outlier detection with a self-training loop to generate high-quality pseudo-labels for unlabeled accounts. By expanding a DeFi-focused seed dataset into a large, feature-rich graph and training a soft-voting ensemble of Random Forest and XGBoost, SLEID achieves superior illicit-account recall and overall accuracy on a multi-million-transaction dataset, while substantially reducing labeled-data requirements. The approach is complemented by explainability analyses (SHAP and LIME) and a discussion of iterative self-learning dynamics, revealing that performance gains peak around the third iteration and highlighting practical considerations for deployment and threshold calibration. Overall, SLEID demonstrates a scalable, data-efficient solution that strengthens DeFi security by reliably identifying illicit Ethereum accounts with high precision and recall, and it offers a path toward real-world, cross-chain fraud detection in the evolving blockchain landscape.

Abstract

The advent of smart contracts has enabled the rapid rise of Decentralized Finance (DeFi) on the Ethereum blockchain, offering substantial rewards in financial innovation and inclusivity. This growth, however, is accompanied by significant security risks such as illicit accounts engaged in fraud. Effective detection is further limited by the scarcity of labeled data and the evolving tactics of malicious accounts. To address these challenges with a robust solution for safeguarding the DeFi ecosystem, we propose $\textbf{SLEID}$, a $\textbf{S}$elf-$\textbf{L}$earning $\textbf{E}$nsemble-based $\textbf{I}$llicit account $\textbf{D}$etection framework. SLEID uses an Isolation Forest model for initial outlier detection and a self-training mechanism to iteratively generate pseudo-labels for unlabeled accounts, enhancing detection accuracy. Experiments on 6,903,860 Ethereum transactions with extensive DeFi interaction coverage demonstrate that SLEID significantly outperforms supervised and semi-supervised baselines with $\textbf{+2.56}$ percentage-point precision, comparable recall, and $\textbf{+0.90}$ percentage-point F1 -- particularly for the minority illicit class -- alongside $\textbf{+3.74}$ percentage-points higher accuracy and improvements in PR-AUC, while substantially reducing reliance on labeled data.

Leveraging Ensemble-Based Semi-Supervised Learning for Illicit Account Detection in Ethereum DeFi Transactions

TL;DR

This work targets illicit account detection in Ethereum DeFi transactions by proposing SLEID, an ensemble-based semi-supervised framework that combines Isolation Forest outlier detection with a self-training loop to generate high-quality pseudo-labels for unlabeled accounts. By expanding a DeFi-focused seed dataset into a large, feature-rich graph and training a soft-voting ensemble of Random Forest and XGBoost, SLEID achieves superior illicit-account recall and overall accuracy on a multi-million-transaction dataset, while substantially reducing labeled-data requirements. The approach is complemented by explainability analyses (SHAP and LIME) and a discussion of iterative self-learning dynamics, revealing that performance gains peak around the third iteration and highlighting practical considerations for deployment and threshold calibration. Overall, SLEID demonstrates a scalable, data-efficient solution that strengthens DeFi security by reliably identifying illicit Ethereum accounts with high precision and recall, and it offers a path toward real-world, cross-chain fraud detection in the evolving blockchain landscape.

Abstract

The advent of smart contracts has enabled the rapid rise of Decentralized Finance (DeFi) on the Ethereum blockchain, offering substantial rewards in financial innovation and inclusivity. This growth, however, is accompanied by significant security risks such as illicit accounts engaged in fraud. Effective detection is further limited by the scarcity of labeled data and the evolving tactics of malicious accounts. To address these challenges with a robust solution for safeguarding the DeFi ecosystem, we propose , a elf-earning nsemble-based llicit account etection framework. SLEID uses an Isolation Forest model for initial outlier detection and a self-training mechanism to iteratively generate pseudo-labels for unlabeled accounts, enhancing detection accuracy. Experiments on 6,903,860 Ethereum transactions with extensive DeFi interaction coverage demonstrate that SLEID significantly outperforms supervised and semi-supervised baselines with percentage-point precision, comparable recall, and percentage-point F1 -- particularly for the minority illicit class -- alongside percentage-points higher accuracy and improvements in PR-AUC, while substantially reducing reliance on labeled data.

Paper Structure

This paper contains 40 sections, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Methodology overview for illicit account detection in the DeFi ecosystem. The pipeline consists of three main components: dataset preparation, model training, and prediction. Illicit accounts are initially sourced from Etherscan and DeFi, followed by dataset expansion and refinement through network analysis. After vectorization and feature extraction, recursive feature elimination is applied to optimize the features. An isolation forest detects outliers, which are then used to update the dataset. The updated dataset is fed into a voting-based ensemble model combining XGBoost and Random Forest classifiers. The trained model is evaluated on a batch of test accounts to produce the final predictions on illicit activity.
  • Figure 2: Network visualization of account interconnections. Blue, red, and gray nodes represent legitimate, illicit, and unknown accounts, respectively.
  • Figure 3: Performance Comparison of Illicit and Licit Class Detection Across Different Models
  • Figure 4: Bipartite graph representation of our dataset. Sets A and B demonstrate the accounts and the transactions, respectively. The directed arrows show the sender(s) and receiver(s) of each transaction.
  • Figure 5: Comparison of Isolation Forest contamination settings (0.25%, 0.5%, 1%) on downstream performance. Panels report precision, recall, F1, and accuracy; 0.5% provides the most consistent balance across metrics.
  • ...and 7 more figures