A Context-Driven Approach for Co-Auditing Smart Contracts with The Support of GPT-4 code interpreter
Mohamed Salah Bouafif, Chen Zheng, Ilham Ahmed Qasse, Ed Zulkoski, Mohammad Hamdaqa, Foutse Khomh
TL;DR
This work tackles the challenge of reliable smart contract auditing with GPT-4 code interpreter by addressing prompt design and context length. It introduces three context-augmentation strategies—Code Call List (CCL) based code scoping, CAQ-based reporting prompts, and CWE-based assessment prompts—to guide the model. Empirical results show that CCL-based chunking with CAQ prompting significantly improves vulnerability detection on public datasets compared to prompting on full code, with detection rates nearing 96% in some configurations, and qualitative human evaluations corroborate improvements on unlabeled data. The authors provide publicly available artifacts and datasets, highlighting practical implications for industry adoption and future research in LLM-assisted code auditing.
Abstract
The surge in the adoption of smart contracts necessitates rigorous auditing to ensure their security and reliability. Manual auditing, although comprehensive, is time-consuming and heavily reliant on the auditor's expertise. With the rise of Large Language Models (LLMs), there is growing interest in leveraging them to assist auditors in the auditing process (co-auditing). However, the effectiveness of LLMs in smart contract co-auditing is contingent upon the design of the input prompts, especially in terms of context description and code length. This paper introduces a novel context-driven prompting technique for smart contract co-auditing. Our approach employs three techniques for context scoping and augmentation, encompassing code scoping to chunk long code into self-contained code segments based on code inter-dependencies, assessment scoping to enhance context description based on the target assessment goal, thereby limiting the search space, and reporting scoping to force a specific format for the generated response. Through empirical evaluations on publicly available vulnerable contracts, our method demonstrated a detection rate of 96\% for vulnerable functions, outperforming the native prompting approach, which detected only 53\%. To assess the reliability of our prompting approach, manual analysis of the results was conducted by expert auditors from our partner, Quantstamp, a world-leading smart contract auditing company. The experts' analysis indicates that, in unlabeled datasets, our proposed approach enhances the proficiency of the GPT-4 code interpreter in detecting vulnerabilities.
