Evaluation of ChatGPT's Smart Contract Auditing Capabilities Based on Chain of Thought
Yuying Du, Xueyan Tang
TL;DR
This study investigates GPT-4's utility in smart contract auditing by evaluating vulnerability detection, code parsing, and PoC writing using Chain-of-Thought prompts. On a SolidiFI-benchmark dataset of 35 contracts with 732 injected vulnerabilities, GPT-4 achieves high precision but modest recall, indicating it misses many vulnerabilities, while still showing strong code-parsing and PoC-writing potential. The methodology includes CoT-based prompts, eight audit reports for code-analysis evaluation, and ten contracts for PoC testing, yielding nuanced insights into GPT-4's capabilities and limitations. Overall, GPT-4 functions effectively as an auxiliary auditing aid rather than a replacement for specialized vulnerability-detection tools and professional auditors, with potential to streamline PoC generation and contract-understanding tasks.
Abstract
Smart contracts, as a key component of blockchain technology, play a crucial role in ensuring the automation of transactions and adherence to protocol rules. However, smart contracts are susceptible to security vulnerabilities, which, if exploited, can lead to significant asset losses. This study explores the potential of enhancing smart contract security audits using the GPT-4 model. We utilized a dataset of 35 smart contracts from the SolidiFI-benchmark vulnerability library, containing 732 vulnerabilities, and compared it with five other vulnerability detection tools to evaluate GPT-4's ability to identify seven common types of vulnerabilities. Moreover, we assessed GPT-4's performance in code parsing and vulnerability capture by simulating a professional auditor's auditing process using CoT(Chain of Thought) prompts based on the audit reports of eight groups of smart contracts. We also evaluated GPT-4's ability to write Solidity Proof of Concepts (PoCs). Through experimentation, we found that GPT-4 performed poorly in detecting smart contract vulnerabilities, with a high Precision of 96.6%, but a low Recall of 37.8%, and an F1-score of 41.1%, indicating a tendency to miss vulnerabilities during detection. Meanwhile, it demonstrated good contract code parsing capabilities, with an average comprehensive score of 6.5, capable of identifying the background information and functional relationships of smart contracts; in 60% of the cases, it could write usable PoCs, suggesting GPT-4 has significant potential application in PoC writing. These experimental results indicate that GPT-4 lacks the ability to detect smart contract vulnerabilities effectively, but its performance in contract code parsing and PoC writing demonstrates its significant potential as an auxiliary tool in enhancing the efficiency and effectiveness of smart contract security audits.
