Table of Contents
Fetching ...

Malicious Code Detection in Smart Contracts via Opcode Vectorization

Huanhuan Zou, Zongwei Li, Xiaoqi Li

TL;DR

This work tackles malicious code detection in Ethereum smart contracts via opcode-based vectorization, classifying opcodes semantically, then applying 2-gram (N=2) and TF-IDF vectorization to feed ML classifiers. It explores a classifier-chain multi-label framework to capture correlations among vulnerabilities. Experiments on 500 Etherscan contracts annotated with Slither show limited performance due to small, imbalanced data, though a 2-gram TF-IDF approach coupled with a Decision Tree yields modest gains ($Accuracy$ around $0.667$, $F1$ around $0.765$). The study highlights data quality and class imbalance as key bottlenecks and suggests future work on larger datasets, cross-version analysis, and semi-supervised learning to improve practical applicability of opcode-based smart contract security tools.

Abstract

With the booming development of blockchain technology, smart contracts have been widely used in finance, supply chain, Internet of things and other fields in recent years. However, the security problems of smart contracts become increasingly prominent. Security events caused by smart contracts occur frequently, and the existence of malicious codes may lead to the loss of user assets and system crash. In this paper, a simple study is carried out on malicious code detection of intelligent contracts based on machine learning. The main research work and achievements are as follows: Feature extraction and vectorization of smart contract are the first step to detect malicious code of smart contract by using machine learning method, and feature processing has an important impact on detection results. In this paper, an opcode vectorization method based on smart contract text is adopted. Based on considering the structural characteristics of contract opcodes, the opcodes are classified and simplified. Then, N-Gram (N=2) algorithm and TF-IDF algorithm are used to convert the simplified opcodes into vectors, and then put into the machine learning model for training. In contrast, N-Gram algorithm and TF-IDF algorithm are directly used to quantify opcodes and put into the machine learning model training. Judging which feature extraction method is better according to the training results. Finally, the classifier chain is applied to the intelligent contract malicious code detection.

Malicious Code Detection in Smart Contracts via Opcode Vectorization

TL;DR

This work tackles malicious code detection in Ethereum smart contracts via opcode-based vectorization, classifying opcodes semantically, then applying 2-gram (N=2) and TF-IDF vectorization to feed ML classifiers. It explores a classifier-chain multi-label framework to capture correlations among vulnerabilities. Experiments on 500 Etherscan contracts annotated with Slither show limited performance due to small, imbalanced data, though a 2-gram TF-IDF approach coupled with a Decision Tree yields modest gains ( around , around ). The study highlights data quality and class imbalance as key bottlenecks and suggests future work on larger datasets, cross-version analysis, and semi-supervised learning to improve practical applicability of opcode-based smart contract security tools.

Abstract

With the booming development of blockchain technology, smart contracts have been widely used in finance, supply chain, Internet of things and other fields in recent years. However, the security problems of smart contracts become increasingly prominent. Security events caused by smart contracts occur frequently, and the existence of malicious codes may lead to the loss of user assets and system crash. In this paper, a simple study is carried out on malicious code detection of intelligent contracts based on machine learning. The main research work and achievements are as follows: Feature extraction and vectorization of smart contract are the first step to detect malicious code of smart contract by using machine learning method, and feature processing has an important impact on detection results. In this paper, an opcode vectorization method based on smart contract text is adopted. Based on considering the structural characteristics of contract opcodes, the opcodes are classified and simplified. Then, N-Gram (N=2) algorithm and TF-IDF algorithm are used to convert the simplified opcodes into vectors, and then put into the machine learning model for training. In contrast, N-Gram algorithm and TF-IDF algorithm are directly used to quantify opcodes and put into the machine learning model training. Judging which feature extraction method is better according to the training results. Finally, the classifier chain is applied to the intelligent contract malicious code detection.

Paper Structure

This paper contains 33 sections, 7 equations, 6 figures, 19 tables.

Figures (6)

  • Figure 1: Bytecode of Smart Contracts
  • Figure 2: Opcodes of Smart Contracts
  • Figure 3: Flowchart of bigram extraction from simplified opcodes
  • Figure 4: Vulnerability detection in smart contracts using Slither
  • Figure 5: Comparison of model training results between two methods
  • ...and 1 more figures