SC-Bench: A Large-Scale Dataset for Smart Contract Auditing
Shihao Xia, Mengting He, Linhai Song, Yiying Zhang
TL;DR
SC-Bench introduces the first large-scale dataset for automated smart-contract auditing, combining 5,377 real Ethereum contracts with 15,975 ERC-rule violations (139 real, 15,836 injected) to benchmark ML methods. Using GPT-4 with full ERC-rule prompts and with oracle-like rule-site information, the study reveals very low baseline detection ($0.9\%$) but notable gains when providing targeted oracle data ($22.9\%$), indicating a substantial opportunity for improvement in ML-based auditing. The dataset integrates real violations and systematically injected errors across ERC20, ERC721, and ERC1155 rules, and releases accompanying code and injection scripts to foster research beyond smart contracts, including API usage rule checks. Overall, SC-Bench demonstrates both the potential of ML-augmented auditing and the current gaps, underscoring the need for broader ERC coverage and more sophisticated prompting or models to advance automated smart-contract safety and compliance.
Abstract
There is a huge demand to ensure the compliance of smart contracts listed on blockchain platforms to safety and economic standards. Today, manual efforts in the form of auditing are commonly used to achieve this goal. ML-based automated techniques have the promise to alleviate human efforts and the resulting monetary costs. However, unlike other domains where ML techniques have had huge successes, no systematic ML techniques have been proposed or applied to smart contract auditing. We present SC-Bench, the first dataset for automated smart-contract auditing research. SC-Bench consists of 5,377 real-world smart contracts running on Ethereum, a widely used blockchain platform, and 15,975 violations of standards on Ehereum called ERCs. Out of these violations, 139 are real violations programmers made. The remaining are errors we systematically injected to reflect the violations of different ERC rules. We evaluate SC-Bench using GPT-4 by prompting it with both the contracts and ERC rules. In addition, we manually identify each violated rule and the corresponding code site (i.e., oracle) and prompt GPT-4 with the information asking for a True-or-False question. Our results show that without the oracle, GPT-4 can only detect 0.9% violations, and with the oracle, it detects 22.9% violations. These results show the potential room for improvement in ML-based techniques for smart-contract auditing.
