Leveraging Ensemble-Based Semi-Supervised Learning for Illicit Account Detection in Ethereum DeFi Transactions
Shabnam Fazliani, Mohammad Mowlavi Sorond, Arsalan Masoudifard, Shaghayegh Fazliani
TL;DR
This work targets illicit account detection in Ethereum DeFi transactions by proposing SLEID, an ensemble-based semi-supervised framework that combines Isolation Forest outlier detection with a self-training loop to generate high-quality pseudo-labels for unlabeled accounts. By expanding a DeFi-focused seed dataset into a large, feature-rich graph and training a soft-voting ensemble of Random Forest and XGBoost, SLEID achieves superior illicit-account recall and overall accuracy on a multi-million-transaction dataset, while substantially reducing labeled-data requirements. The approach is complemented by explainability analyses (SHAP and LIME) and a discussion of iterative self-learning dynamics, revealing that performance gains peak around the third iteration and highlighting practical considerations for deployment and threshold calibration. Overall, SLEID demonstrates a scalable, data-efficient solution that strengthens DeFi security by reliably identifying illicit Ethereum accounts with high precision and recall, and it offers a path toward real-world, cross-chain fraud detection in the evolving blockchain landscape.
Abstract
The advent of smart contracts has enabled the rapid rise of Decentralized Finance (DeFi) on the Ethereum blockchain, offering substantial rewards in financial innovation and inclusivity. This growth, however, is accompanied by significant security risks such as illicit accounts engaged in fraud. Effective detection is further limited by the scarcity of labeled data and the evolving tactics of malicious accounts. To address these challenges with a robust solution for safeguarding the DeFi ecosystem, we propose $\textbf{SLEID}$, a $\textbf{S}$elf-$\textbf{L}$earning $\textbf{E}$nsemble-based $\textbf{I}$llicit account $\textbf{D}$etection framework. SLEID uses an Isolation Forest model for initial outlier detection and a self-training mechanism to iteratively generate pseudo-labels for unlabeled accounts, enhancing detection accuracy. Experiments on 6,903,860 Ethereum transactions with extensive DeFi interaction coverage demonstrate that SLEID significantly outperforms supervised and semi-supervised baselines with $\textbf{+2.56}$ percentage-point precision, comparable recall, and $\textbf{+0.90}$ percentage-point F1 -- particularly for the minority illicit class -- alongside $\textbf{+3.74}$ percentage-points higher accuracy and improvements in PR-AUC, while substantially reducing reliance on labeled data.
