Evaluating LLaMA 3.2 for Software Vulnerability Detection
José Gonçalves, Miguel Silva, Bernardo Cabral, Tiago Dias, Eva Maia, Isabel Praça, Ricardo Severino, Luís Lino Ferreira
TL;DR
This work addresses the challenge of vulnerability detection in real-world C/C++ code by refining the DiverseVul dataset with the SCoPE preprocessing framework and evaluating a lightweight LLM, LLaMA 3.2, fine-tuned via LoRA for binary classification. The authors demonstrate that data preprocessing reduces token length and helps the model focus on code semantics rather than programmer-chosen identifiers, yielding a $F1$-Score of $66\%$—a competitive result compared to a strong baseline of $47\%$. Key contributions include the public release of a cleaned DiverseVul variant, an analysis of preprocessing impact on SVD, and evidence that small, PEFT-based models can effectively perform vulnerability detection on real-world code. The findings have practical implications for deploying SVD on lower-end hardware and for guiding future work in dataset curation, robustness, and prompt-engineering strategies for code-security tasks.
Abstract
Deep Learning (DL) has emerged as a powerful tool for vulnerability detection, often outperforming traditional solutions. However, developing effective DL models requires large amounts of real-world data, which can be difficult to obtain in sufficient quantities. To address this challenge, DiverseVul dataset has been curated as the largest dataset of vulnerable and non-vulnerable C/C++ functions extracted exclusively from real-world projects. Its goal is to provide high-quality, large-scale samples for training DL models. However, during our study several inconsistencies were identified in the raw dataset while applying pre-processing techniques, highlighting the need for a refined version. In this work, we present a refined version of DiverseVul dataset, which is used to fine-tune a large language model, LLaMA 3.2, for vulnerability detection. Experimental results show that the use of pre-processing techniques led to an improvement in performance, with the model achieving an F1-Score of 66%, a competitive result when compared to our baseline, which achieved a 47% F1-Score in software vulnerability detection.
