Table of Contents
Fetching ...

Investigating Large Language Models for Code Vulnerability Detection: An Experimental Study

Xuefeng Jiang, Lvhua Wu, Sheng Sun, Jia Li, Jingjing Xue, Yuwei Wang, Tingting Wu, Min Liu

TL;DR

This work tackles code vulnerability detection (CVD) by systematically benchmarking fine-tuned large language models (LLMs) against graph-based and medium-size sequence models across five datasets, including long-code samples. It employs LoRA-based fine-tuning of four open-source LLMs and compares them with three graph-based and two medium-size sequence models in a unified open-source framework, with careful handling of dataset balance and input length via short/long splits and 25,000-sample subsampling. The results reveal that LLMs excel on long-context data and exhibit low false-positive rates, but their performance is highly sensitive to class balance, whereas medium-size models perform better on short samples; longer sequences benefit from more data, but imbalance remains a dominant factor. The study provides an open, reproducible benchmark and actionable insights for deploying CVD systems, while outlining future work on data quality, larger models, and alternative parameter-efficient fine-tuning methods to further improve robustness and scalability.

Abstract

Code vulnerability detection (CVD) is essential for addressing and preventing system security issues, playing a crucial role in ensuring software security. Previous learning-based vulnerability detection methods rely on either fine-tuning medium-size sequence models or training smaller neural networks from scratch. Recent advancements in large pre-trained language models (LLMs) have showcased remarkable capabilities in various code intelligence tasks including code understanding and generation. However, the effectiveness of LLMs in detecting code vulnerabilities is largely under-explored. This work aims to investigate the gap by fine-tuning LLMs for the CVD task, involving four widely-used open-source LLMs. We also implement other five previous graph-based or medium-size sequence models for comparison. Experiments are conducted on five commonly-used CVD datasets, including both the part of short samples and long samples. In addition, we conduct quantitative experiments to investigate the class imbalance issue and the model's performance on samples of different lengths, which are rarely studied in previous works. To better facilitate communities, we open-source all codes and resources of this study in https://github.com/SakiRinn/LLM4CVD and https://huggingface.co/datasets/xuefen/VulResource.

Investigating Large Language Models for Code Vulnerability Detection: An Experimental Study

TL;DR

This work tackles code vulnerability detection (CVD) by systematically benchmarking fine-tuned large language models (LLMs) against graph-based and medium-size sequence models across five datasets, including long-code samples. It employs LoRA-based fine-tuning of four open-source LLMs and compares them with three graph-based and two medium-size sequence models in a unified open-source framework, with careful handling of dataset balance and input length via short/long splits and 25,000-sample subsampling. The results reveal that LLMs excel on long-context data and exhibit low false-positive rates, but their performance is highly sensitive to class balance, whereas medium-size models perform better on short samples; longer sequences benefit from more data, but imbalance remains a dominant factor. The study provides an open, reproducible benchmark and actionable insights for deploying CVD systems, while outlining future work on data quality, larger models, and alternative parameter-efficient fine-tuning methods to further improve robustness and scalability.

Abstract

Code vulnerability detection (CVD) is essential for addressing and preventing system security issues, playing a crucial role in ensuring software security. Previous learning-based vulnerability detection methods rely on either fine-tuning medium-size sequence models or training smaller neural networks from scratch. Recent advancements in large pre-trained language models (LLMs) have showcased remarkable capabilities in various code intelligence tasks including code understanding and generation. However, the effectiveness of LLMs in detecting code vulnerabilities is largely under-explored. This work aims to investigate the gap by fine-tuning LLMs for the CVD task, involving four widely-used open-source LLMs. We also implement other five previous graph-based or medium-size sequence models for comparison. Experiments are conducted on five commonly-used CVD datasets, including both the part of short samples and long samples. In addition, we conduct quantitative experiments to investigate the class imbalance issue and the model's performance on samples of different lengths, which are rarely studied in previous works. To better facilitate communities, we open-source all codes and resources of this study in https://github.com/SakiRinn/LLM4CVD and https://huggingface.co/datasets/xuefen/VulResource.

Paper Structure

This paper contains 22 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Processing procedures for sequence-based models and graph-based models. We use simple naive tokenizer in this figure as an illustrative example.
  • Figure 2: Prompt template for large language models. {code} indicates the code content to be filled in.
  • Figure 3: Metrics on Varing Positive Sample Ratio on the DiverseVul diversevul Dataset.
  • Figure 4: Metrics on Varing Positive Sample Ratio on the Draper draper Dataset.
  • Figure 5: Metrics on Varying Code Sequence Length.
  • ...and 2 more figures