Retrieval Augmented Generation Integrated Large Language Models in Smart Contract Vulnerability Detection
Jeffy Yu
TL;DR
This work investigates democratizing smart-contract security auditing by integrating Retrieval-Augmented Generation (RAG) with large-language models (LLMs), specifically GPT-4-1106, to detect vulnerabilities in DeFi contracts. The authors build a 830-contract vulnerability vector store using Pinecone and OpenAI embeddings, and evaluate the RAG-LLM pipeline under guided and blind prompts using a two-phase experimental design. Phase One (guided) achieves 62.7% accuracy, while Phase Two (blind) achieves 60.71%, demonstrating promising generalization but also highlighting variability and the continued need for human review. The study argues that RAG-LLMs can lower auditing costs and broaden access, while emphasizing limitations related to data integrity, prompt compliance, context handling, and ethical considerations, charting a path for scalable, responsible deployment in real-world DeFi security workflows.
Abstract
The rapid growth of Decentralized Finance (DeFi) has been accompanied by substantial financial losses due to smart contract vulnerabilities, underscoring the critical need for effective security auditing. With attacks becoming more frequent, the necessity and demand for auditing services has escalated. This especially creates a financial burden for independent developers and small businesses, who often have limited available funding for these services. Our study builds upon existing frameworks by integrating Retrieval-Augmented Generation (RAG) with large language models (LLMs), specifically employing GPT-4-1106 for its 128k token context window. We construct a vector store of 830 known vulnerable contracts, leveraging Pinecone for vector storage, OpenAI's text-embedding-ada-002 for embeddings, and LangChain to construct the RAG-LLM pipeline. Prompts were designed to provide a binary answer for vulnerability detection. We first test 52 smart contracts 40 times each against a provided vulnerability type, verifying the replicability and consistency of the RAG-LLM. Encouraging results were observed, with a 62.7% success rate in guided detection of vulnerabilities. Second, we challenge the model under a "blind" audit setup, without the vulnerability type provided in the prompt, wherein 219 contracts undergo 40 tests each. This setup evaluates the general vulnerability detection capabilities without hinted context assistance. Under these conditions, a 60.71% success rate was observed. While the results are promising, we still emphasize the need for human auditing at this time. We provide this study as a proof of concept for a cost-effective smart contract auditing process, moving towards democratic access to security.
