Table of Contents
Fetching ...

PrAIoritize: Automated Early Prediction and Prioritization of Vulnerabilities in Smart Contracts

Majd Soud, Grischa Liebel, Mohammad Hamdaqa

TL;DR

PrAIoritize tackles the problem of unreliable and time-consuming manual triage during smart contract code reviews by introducing a three-phase pipeline that combines a lexicon-based automatic labeling of code weaknesses with a DistilBERT classifier. The approach leverages CVE/NVD data and GitHub reviews to build a domain-specific lexicon, enabling automated labeling, followed by feature engineering and transformer-based classification to assign four priority levels. Empirical results show that PrAIoritize outperforms state-of-the-art baselines and several pretrained models, achieving high F1-scores, particularly for critical weaknesses, and reveals meaningful insights into zero-day attack prevalence in Ethereum smart contracts. The work demonstrates the practical potential of NLP and LLM-enabled triage to accelerate secure smart contract development and auditing, while outlining avenues for broader data sources and model enhancements.

Abstract

Context:Smart contracts are prone to numerous security threats due to undisclosed vulnerabilities and code weaknesses. In Ethereum smart contracts, the challenges of timely addressing these code weaknesses highlight the critical need for automated early prediction and prioritization during the code review process. Efficient prioritization is crucial for smart contract security. Objective:Toward this end, our research aims to provide an automated approach, PrAIoritize, for prioritizing and predicting critical code weaknesses in Ethereum smart contracts during the code review process. Method: To do so, we collected smart contract code reviews sourced from Open Source Software (OSS) on GitHub and the Common Vulnerabilities and Exposures (CVE) database. Subsequently, we developed PrAIoritize, an innovative automated prioritization approach. PrAIoritize integrates advanced Large Language Models (LLMs) with sophisticated natural language processing (NLP) techniques. PrAIoritize automates code review labeling by employing a domain-specific lexicon of smart contract weaknesses and their impacts. Following this, feature engineering is conducted for code reviews, and a pre-trained DistilBERT model is utilized for priority classification. Finally, the model is trained and evaluated using code reviews of smart contracts. Results: Our evaluation demonstrates significant improvement over state-of-the-art baselines and commonly used pre-trained models (e.g. T5) for similar classification tasks, with 4.82\%-27.94\% increase in F-measure, precision, and recall. Conclusion: By leveraging PrAIoritize, practitioners can efficiently prioritize smart contract code weaknesses, addressing critical code weaknesses promptly and reducing the time and effort required for manual triage.

PrAIoritize: Automated Early Prediction and Prioritization of Vulnerabilities in Smart Contracts

TL;DR

PrAIoritize tackles the problem of unreliable and time-consuming manual triage during smart contract code reviews by introducing a three-phase pipeline that combines a lexicon-based automatic labeling of code weaknesses with a DistilBERT classifier. The approach leverages CVE/NVD data and GitHub reviews to build a domain-specific lexicon, enabling automated labeling, followed by feature engineering and transformer-based classification to assign four priority levels. Empirical results show that PrAIoritize outperforms state-of-the-art baselines and several pretrained models, achieving high F1-scores, particularly for critical weaknesses, and reveals meaningful insights into zero-day attack prevalence in Ethereum smart contracts. The work demonstrates the practical potential of NLP and LLM-enabled triage to accelerate secure smart contract development and auditing, while outlining avenues for broader data sources and model enhancements.

Abstract

Context:Smart contracts are prone to numerous security threats due to undisclosed vulnerabilities and code weaknesses. In Ethereum smart contracts, the challenges of timely addressing these code weaknesses highlight the critical need for automated early prediction and prioritization during the code review process. Efficient prioritization is crucial for smart contract security. Objective:Toward this end, our research aims to provide an automated approach, PrAIoritize, for prioritizing and predicting critical code weaknesses in Ethereum smart contracts during the code review process. Method: To do so, we collected smart contract code reviews sourced from Open Source Software (OSS) on GitHub and the Common Vulnerabilities and Exposures (CVE) database. Subsequently, we developed PrAIoritize, an innovative automated prioritization approach. PrAIoritize integrates advanced Large Language Models (LLMs) with sophisticated natural language processing (NLP) techniques. PrAIoritize automates code review labeling by employing a domain-specific lexicon of smart contract weaknesses and their impacts. Following this, feature engineering is conducted for code reviews, and a pre-trained DistilBERT model is utilized for priority classification. Finally, the model is trained and evaluated using code reviews of smart contracts. Results: Our evaluation demonstrates significant improvement over state-of-the-art baselines and commonly used pre-trained models (e.g. T5) for similar classification tasks, with 4.82\%-27.94\% increase in F-measure, precision, and recall. Conclusion: By leveraging PrAIoritize, practitioners can efficiently prioritize smart contract code weaknesses, addressing critical code weaknesses promptly and reducing the time and effort required for manual triage.
Paper Structure (32 sections, 9 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 32 sections, 9 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustrated Examples: Real-World Smart Contract Code Reviews
  • Figure 2: Timeline of Zero-day Attacks bilge2012before.
  • Figure 3: PrAIoritize Approach Overview
  • Figure 4: Confusion matrix for the classification results of the PrAIoritize model.