Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study

Kohei Dozono; Tiago Espinha Gasiba; Andrea Stocco

Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study

Kohei Dozono, Tiago Espinha Gasiba, Andrea Stocco

TL;DR

This work systematically evaluates six state-of-the-art LLMs for vulnerability detection and CWE classification across five programming languages, using a carefully curated multi-language dataset and a range of prompting strategies. It demonstrates that GPT-4 Turbo and GPT-4o generally provide the best detection and classification performance, with few-shot prompting further boosting results for CWE classification. To bridge research and practice, the authors implement CodeGuardian, a VSCode extension that enables real-time LLM-assisted vulnerability analysis, and validate its value through a user study with industry practitioners showing substantial gains in accuracy and speed. The study highlights language- and vulnerability-type dependent performance, underscores CWE classification as more challenging than detection, and sets the stage for continued dataset enrichment and prompting innovations to advance secure software engineering in real-world workflows.

Abstract

Most vulnerability detection studies focus on datasets of vulnerabilities in C/C++ code, offering limited language diversity. Thus, the effectiveness of deep learning methods, including large language models (LLMs), in detecting software vulnerabilities beyond these languages is still largely unexplored. In this paper, we evaluate the effectiveness of LLMs in detecting and classifying Common Weakness Enumerations (CWE) using different prompt and role strategies. Our experimental study targets six state-of-the-art pre-trained LLMs (GPT-3.5- Turbo, GPT-4 Turbo, GPT-4o, CodeLLama-7B, CodeLLama- 13B, and Gemini 1.5 Pro) and five programming languages: Python, C, C++, Java, and JavaScript. We compiled a multi-language vulnerability dataset from different sources, to ensure representativeness. Our results showed that GPT-4o achieves the highest vulnerability detection and CWE classification scores using a few-shot setting. Aside from the quantitative results of our study, we developed a library called CODEGUARDIAN integrated with VSCode which enables developers to perform LLM-assisted real-time vulnerability analysis in real-world security scenarios. We have evaluated CODEGUARDIAN with a user study involving 22 developers from the industry. Our study showed that, by using CODEGUARDIAN, developers are more accurate and faster at detecting vulnerabilities.

Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study

TL;DR

Abstract

Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study

Authors

TL;DR

Abstract

Table of Contents

Figures (2)