An Empirical Analysis of Vulnerability Detection Tools for Solidity Smart Contracts Using Line Level Manually Annotated Vulnerabilities
Francesco Salzano, Cosmo Kevin Antenucci, Simone Scalabrino, Giovanni Rosa, Rocco Oliveto, Remo Pareschi
TL;DR
The paper tackles the security of Solidity smart contracts by conducting a large-scale, empirically grounded evaluation of vulnerability detection tools within SmartBugs 2.0, using a manually annotated line-level dataset of 2,182 instances mapped to the DASP TOP 10 taxonomy. It also assesses an LLM-based detector (ChatGPT-4o) across two diverse datasets and demonstrates that no single tool covers all vulnerability classes, advocating a cluster-based combination of Conkas, Slither, and Smartcheck that detects up to 76.78% of vulnerabilities with under one minute of average runtime. The authors release the largest line-level annotated dataset to date and provide detailed analyses of tool strengths and failure modes, highlighting significant gaps when applied to real-world contracts. Their findings underscore the importance of multi-tool ensembles and careful consideration of dataset complexity in vulnerability detection, with practical implications for secure smart contract development and auditing practices.
Abstract
The rapid adoption of blockchain technology highlighted the importance of ensuring the security of smart contracts due to their critical role in automated business logic execution on blockchain platforms. This paper provides an empirical evaluation of automated vulnerability analysis tools specifically designed for Solidity smart contracts. Leveraging the extensive SmartBugs 2.0 framework, which includes 20 analysis tools, we conducted a comprehensive assessment using an annotated dataset of 2,182 instances we manually annotated with line-level vulnerability labels. Our evaluation highlights the detection effectiveness of these tools in detecting various types of vulnerabilities, as categorized by the DASP TOP 10 taxonomy. We evaluated the effectiveness of a Large Language Model-based detection method on two popular datasets. In this case, we obtained inconsistent results with the two datasets, showing unreliable detection when analyzing real-world smart contracts. Our study identifies significant variations in the accuracy and reliability of different tools and demonstrates the advantages of combining multiple detection methods to improve vulnerability identification. We identified a set of 3 tools that, combined, achieve up to 76.78\% found vulnerabilities taking less than one minute to run, on average. This study contributes to the field by releasing the largest dataset of manually analyzed smart contracts with line-level vulnerability annotations and the empirical evaluation of the greatest number of tools to date.
