Table of Contents
Fetching ...

Detect Llama -- Finding Vulnerabilities in Smart Contracts using Large Language Models

Peter Ince, Xiapu Luo, Jiangshan Yu, Joseph K. Liu, Xiaoning Du

TL;DR

The paper investigates whether fine-tuned open-source language models can outperform OpenAI's GPT-4 in smart contract vulnerability detection. By training two Code Llama variants (Foundation and Instruct) on a 17k-prompt dataset and fine-tuning GPT-3.5 Turbo on a 4k-prompt subset, the authors compare against GPT-4 and GPT-4 Turbo using a custom test set spanning eight vulnerabilities. Across binary vulnerability detection and multi-vulnerability identification, the open-source models and especially GPT-3.5 Turbo achieve competitive or superior weighted F1 scores compared to GPT-4, with GPT-3.5 Turbo often delivering the best overall performance. The work also releases open-source models, prompts, and datasets to enable future research, and discusses practical implications for scalability, solidity-version coverage, and model-size trade-offs.

Abstract

In this paper, we test the hypothesis that although OpenAI's GPT-4 performs well generally, we can fine-tune open-source models to outperform GPT-4 in smart contract vulnerability detection. We fine-tune two models from Meta's Code Llama and a dataset of 17k prompts, Detect Llama - Foundation and Detect Llama - Instruct, and we also fine-tune OpenAI's GPT-3.5 Turbo model (GPT-3.5FT). We then evaluate these models, plus a random baseline, on a testset we develop against GPT-4, and GPT-4 Turbo's, detection of eight vulnerabilities from the dataset and the two top identified vulnerabilities - and their weighted F1 scores. We find that for binary classification (i.e., is this smart contract vulnerable?), our two best-performing models, GPT-3.5FT and Detect Llama - Foundation, achieve F1 scores of $0.776$ and $0.68$, outperforming both GPT-4 and GPT-4 Turbo, $0.66$ and $0.675$. For the evaluation against individual vulnerability identification, our top two models, GPT-3.5FT and Detect Llama - Foundation, both significantly outperformed GPT-4 and GPT-4 Turbo in both weighted F1 for all vulnerabilities ($0.61$ and $0.56$ respectively against GPT-4's $0.218$ and GPT-4 Turbo's $0.243$) and weighted F1 for the top two identified vulnerabilities ($0.719$ for GPT-3.5FT, $0.674$ for Detect Llama - Foundation against GPT-4's $0.363$ and GPT-4 Turbo's $0.429$).

Detect Llama -- Finding Vulnerabilities in Smart Contracts using Large Language Models

TL;DR

The paper investigates whether fine-tuned open-source language models can outperform OpenAI's GPT-4 in smart contract vulnerability detection. By training two Code Llama variants (Foundation and Instruct) on a 17k-prompt dataset and fine-tuning GPT-3.5 Turbo on a 4k-prompt subset, the authors compare against GPT-4 and GPT-4 Turbo using a custom test set spanning eight vulnerabilities. Across binary vulnerability detection and multi-vulnerability identification, the open-source models and especially GPT-3.5 Turbo achieve competitive or superior weighted F1 scores compared to GPT-4, with GPT-3.5 Turbo often delivering the best overall performance. The work also releases open-source models, prompts, and datasets to enable future research, and discusses practical implications for scalability, solidity-version coverage, and model-size trade-offs.

Abstract

In this paper, we test the hypothesis that although OpenAI's GPT-4 performs well generally, we can fine-tune open-source models to outperform GPT-4 in smart contract vulnerability detection. We fine-tune two models from Meta's Code Llama and a dataset of 17k prompts, Detect Llama - Foundation and Detect Llama - Instruct, and we also fine-tune OpenAI's GPT-3.5 Turbo model (GPT-3.5FT). We then evaluate these models, plus a random baseline, on a testset we develop against GPT-4, and GPT-4 Turbo's, detection of eight vulnerabilities from the dataset and the two top identified vulnerabilities - and their weighted F1 scores. We find that for binary classification (i.e., is this smart contract vulnerable?), our two best-performing models, GPT-3.5FT and Detect Llama - Foundation, achieve F1 scores of and , outperforming both GPT-4 and GPT-4 Turbo, and . For the evaluation against individual vulnerability identification, our top two models, GPT-3.5FT and Detect Llama - Foundation, both significantly outperformed GPT-4 and GPT-4 Turbo in both weighted F1 for all vulnerabilities ( and respectively against GPT-4's and GPT-4 Turbo's ) and weighted F1 for the top two identified vulnerabilities ( for GPT-3.5FT, for Detect Llama - Foundation against GPT-4's and GPT-4 Turbo's ).
Paper Structure (42 sections, 4 tables)