Table of Contents
Fetching ...

Large Language Model based Smart Contract Auditing with LLMBugScanner

Yining Yuan, Yifei Wang, Yichang Xu, Zachary Yahn, Sihao Hu, Ling Liu

TL;DR

The paper tackles unreliable vulnerability detection in smart contracts by introducing LLMBugScanner, which couples domain knowledge adaptation with ensemble reasoning to improve generalization across vulnerability types and code structures. It employs two-stage fine-tuning via LoRA on a broad Ethereum dataset and a CVE-derived instructional set, and combines multiple lightweight LLMs through weighted majority and tie-breaking voting to boost robustness and coverage. Experimental results on a CVE-Solidity benchmark show that finetuned models outperform baselines and that ensembles achieve the highest Top-5 hit rates, with a 60% top-5 accuracy on 108 CVE-labeled contracts and a 19% improvement over single-model baselines. The framework is presented as scalable, cost-effective, and extensible, with future directions including learning-based ensembles, hallucination mitigation, and code normalization to further enhance reliability in real-world smart contract auditing.

Abstract

This paper presents LLMBugScanner, a large language model (LLM) based framework for smart contract vulnerability detection using fine-tuning and ensemble learning. Smart contract auditing presents several challenges for LLMs: different pretrained models exhibit varying reasoning abilities, and no single model performs consistently well across all vulnerability types or contract structures. These limitations persist even after fine-tuning individual LLMs. To address these challenges, LLMBugScanner combines domain knowledge adaptation with ensemble reasoning to improve robustness and generalization. Through domain knowledge adaptation, we fine-tune LLMs on complementary datasets to capture both general code semantics and instruction-guided vulnerability reasoning, using parameter-efficient tuning to reduce computational cost. Through ensemble reasoning, we leverage the complementary strengths of multiple LLMs and apply a consensus-based conflict resolution strategy to produce more reliable vulnerability assessments. We conduct extensive experiments across multiple popular LLMs and compare LLMBugScanner with both pretrained and fine-tuned individual models. Results show that LLMBugScanner achieves consistent accuracy improvements and stronger generalization, demonstrating that it provides a principled, cost-effective, and extensible framework for smart contract auditing.

Large Language Model based Smart Contract Auditing with LLMBugScanner

TL;DR

The paper tackles unreliable vulnerability detection in smart contracts by introducing LLMBugScanner, which couples domain knowledge adaptation with ensemble reasoning to improve generalization across vulnerability types and code structures. It employs two-stage fine-tuning via LoRA on a broad Ethereum dataset and a CVE-derived instructional set, and combines multiple lightweight LLMs through weighted majority and tie-breaking voting to boost robustness and coverage. Experimental results on a CVE-Solidity benchmark show that finetuned models outperform baselines and that ensembles achieve the highest Top-5 hit rates, with a 60% top-5 accuracy on 108 CVE-labeled contracts and a 19% improvement over single-model baselines. The framework is presented as scalable, cost-effective, and extensible, with future directions including learning-based ensembles, hallucination mitigation, and code normalization to further enhance reliability in real-world smart contract auditing.

Abstract

This paper presents LLMBugScanner, a large language model (LLM) based framework for smart contract vulnerability detection using fine-tuning and ensemble learning. Smart contract auditing presents several challenges for LLMs: different pretrained models exhibit varying reasoning abilities, and no single model performs consistently well across all vulnerability types or contract structures. These limitations persist even after fine-tuning individual LLMs. To address these challenges, LLMBugScanner combines domain knowledge adaptation with ensemble reasoning to improve robustness and generalization. Through domain knowledge adaptation, we fine-tune LLMs on complementary datasets to capture both general code semantics and instruction-guided vulnerability reasoning, using parameter-efficient tuning to reduce computational cost. Through ensemble reasoning, we leverage the complementary strengths of multiple LLMs and apply a consensus-based conflict resolution strategy to produce more reliable vulnerability assessments. We conduct extensive experiments across multiple popular LLMs and compare LLMBugScanner with both pretrained and fine-tuned individual models. Results show that LLMBugScanner achieves consistent accuracy improvements and stronger generalization, demonstrating that it provides a principled, cost-effective, and extensible framework for smart contract auditing.

Paper Structure

This paper contains 22 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Confusion matrices of vulnerability classification before and after fine-tuning (FT). The left matrix shows DeepSeek’s baseline predictions, where diverse categories such as “Access Control,” “Token Creation Vulnerability,” and “Wrong Logic” were often misclassified as “Integer Overflow” or “Token Devalue.” After fine-tuning (right), the model demonstrates substantially improved discrimination, with most cases of “Integer Overflow” correctly identified and reduced confusion across categories.
  • Figure 2: Overview of LLMBugScanner, which consists of two stages: domain knowledge adaptation and model ensemble to improve the effectivness of smart contract vulnerability detection.
  • Figure 3: Comparative distributions of vulnerability categories across the Ethereum and CVE datasets.
  • Figure 4: Auditor prompt
  • Figure 5: Training loss curves for models fine-tuned on Ethereum only (a), CVE only (b), and both datasets (c). The plots show rapid loss reduction during the early epochs with continued convergence across all settings. Notably, when fine-tuned on both datasets, the loss for the second dataset starts lower than in the standalone training scenario, highlighting the effectiveness of pre-training on the first dataset.
  • ...and 2 more figures