Table of Contents
Fetching ...

When ChatGPT Meets Smart Contract Vulnerability Detection: How Far Are We?

Chong Chen, Jianzhong Su, Jiachi Chen, Yanlin Wang, Tingting Bi, Jianxing Yu, Yanli Wang, Xingwei Lin, Ting Chen, Zibin Zheng

TL;DR

The paper conducts an empirical evaluation of ChatGPT for smart contract vulnerability detection using the Smartbugs-curated dataset and a prompt-optimized, two-stage workflow. It reveals a high recall but low precision across nine DASP10 vulnerabilities, with performance varying by type and significant false positive risks. When benchmarked against 14 SOTA tools, ChatGPT offers broad vulnerability coverage and excellent efficiency but generally underperforms on F1 in several categories due to token-length and contextual limits. The study also analyzes limitations related to uncertainty and context length, and explores strategies such as external knowledge integration, multi-round conversations, code poisoning, and obfuscation attacks to assess robustness. Overall, the work provides nuanced insights into the strengths and weaknesses of LLM-based vulnerability detection and outlines concrete directions for improving reliability and scalability in blockchain security contexts.

Abstract

With the development of blockchain technology, smart contracts have become an important component of blockchain applications. Despite their crucial role, the development of smart contracts may introduce vulnerabilities and potentially lead to severe consequences, such as financial losses. Meanwhile, large language models, represented by ChatGPT, have gained great attentions, showcasing great capabilities in code analysis tasks. In this paper, we presented an empirical study to investigate the performance of ChatGPT in identifying smart contract vulnerabilities. Initially, we evaluated ChatGPT's effectiveness using a publicly available smart contract dataset. Our findings discover that while ChatGPT achieves a high recall rate, its precision in pinpointing smart contract vulnerabilities is limited. Furthermore, ChatGPT's performance varies when detecting different vulnerability types. We delved into the root causes for the false positives generated by ChatGPT, and categorized them into four groups. Second, by comparing ChatGPT with other state-of-the-art smart contract vulnerability detection tools, we found that ChatGPT's F-score is lower than others for 3 out of the 7 vulnerabilities. In the case of the remaining 4 vulnerabilities, ChatGPT exhibits a slight advantage over these tools. Finally, we analyzed the limitation of ChatGPT in smart contract vulnerability detection, revealing that the robustness of ChatGPT in this field needs to be improved from two aspects: its uncertainty in answering questions; and the limited length of the detected code. In general, our research provides insights into the strengths and weaknesses of employing large language models, specifically ChatGPT, for the detection of smart contract vulnerabilities.

When ChatGPT Meets Smart Contract Vulnerability Detection: How Far Are We?

TL;DR

The paper conducts an empirical evaluation of ChatGPT for smart contract vulnerability detection using the Smartbugs-curated dataset and a prompt-optimized, two-stage workflow. It reveals a high recall but low precision across nine DASP10 vulnerabilities, with performance varying by type and significant false positive risks. When benchmarked against 14 SOTA tools, ChatGPT offers broad vulnerability coverage and excellent efficiency but generally underperforms on F1 in several categories due to token-length and contextual limits. The study also analyzes limitations related to uncertainty and context length, and explores strategies such as external knowledge integration, multi-round conversations, code poisoning, and obfuscation attacks to assess robustness. Overall, the work provides nuanced insights into the strengths and weaknesses of LLM-based vulnerability detection and outlines concrete directions for improving reliability and scalability in blockchain security contexts.

Abstract

With the development of blockchain technology, smart contracts have become an important component of blockchain applications. Despite their crucial role, the development of smart contracts may introduce vulnerabilities and potentially lead to severe consequences, such as financial losses. Meanwhile, large language models, represented by ChatGPT, have gained great attentions, showcasing great capabilities in code analysis tasks. In this paper, we presented an empirical study to investigate the performance of ChatGPT in identifying smart contract vulnerabilities. Initially, we evaluated ChatGPT's effectiveness using a publicly available smart contract dataset. Our findings discover that while ChatGPT achieves a high recall rate, its precision in pinpointing smart contract vulnerabilities is limited. Furthermore, ChatGPT's performance varies when detecting different vulnerability types. We delved into the root causes for the false positives generated by ChatGPT, and categorized them into four groups. Second, by comparing ChatGPT with other state-of-the-art smart contract vulnerability detection tools, we found that ChatGPT's F-score is lower than others for 3 out of the 7 vulnerabilities. In the case of the remaining 4 vulnerabilities, ChatGPT exhibits a slight advantage over these tools. Finally, we analyzed the limitation of ChatGPT in smart contract vulnerability detection, revealing that the robustness of ChatGPT in this field needs to be improved from two aspects: its uncertainty in answering questions; and the limited length of the detected code. In general, our research provides insights into the strengths and weaknesses of employing large language models, specifically ChatGPT, for the detection of smart contract vulnerabilities.
Paper Structure (25 sections, 5 equations, 7 figures, 11 tables)

This paper contains 25 sections, 5 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: The overview of our research.
  • Figure 2: An example of prompt optimization using the GPT-4 model.
  • Figure 3: The uncertainty test results of ChatGPT. Each part displays the number of contracts with consistent results, for example, 5 consistent represents the number of contracts with consistent results from all 5 tests.
  • Figure 4: GPT-4's original detection results of smart contracts containing Price Oracle Manipulation vulnerabilities.
  • Figure 5: GPT-4's improved detection results of smart contracts containing Price Oracle Manipulation vulnerabilities.
  • ...and 2 more figures