Table of Contents
Fetching ...

VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching

Arastoo Zibaeirad, Marco Vieira

TL;DR

<3-5 sentence high-level summary> VulnLLMEval presents a comprehensive benchmark and automated dataset for evaluating large language models on software vulnerability detection (SVD) and patching (SVP) using real-world Linux kernel vulnerabilities. The framework automatically harvests vulnerable and patched code blocks via commit histories, enabling robust, reproducible evaluation of 10 LLMs across 8 tasks with metrics such as MRR, Top-5, ROUGE-L, CodeBLEU, and cyclomatic complexity. Results reveal significant model-to-model variability in vulnerability ranking and detection, with models often oversimplifying patches and struggling to distinguish vulnerable from patched code, though performance improves with larger context windows. The work provides actionable directions for improvement, including vulnerability-focused fine-tuning, memory-enabled learning, and better prompting to achieve more reliable SVD/SVP in practice.

Abstract

Large Language Models (LLMs) have shown promise in tasks like code translation, prompting interest in their potential for automating software vulnerability detection (SVD) and patching (SVP). To further research in this area, establishing a benchmark is essential for evaluating the strengths and limitations of LLMs in these tasks. Despite their capabilities, questions remain regarding whether LLMs can accurately analyze complex vulnerabilities and generate appropriate patches. This paper introduces VulnLLMEval, a framework designed to assess the performance of LLMs in identifying and patching vulnerabilities in C code. Our study includes 307 real-world vulnerabilities extracted from the Linux kernel, creating a well-curated dataset that includes both vulnerable and patched code. This dataset, based on real-world code, provides a diverse and representative testbed for evaluating LLM performance in SVD and SVP tasks, offering a robust foundation for rigorous assessment. Our results reveal that LLMs often struggle with distinguishing between vulnerable and patched code. Furthermore, in SVP tasks, these models tend to oversimplify the code, producing solutions that may not be directly usable without further refinement.

VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching

TL;DR

<3-5 sentence high-level summary> VulnLLMEval presents a comprehensive benchmark and automated dataset for evaluating large language models on software vulnerability detection (SVD) and patching (SVP) using real-world Linux kernel vulnerabilities. The framework automatically harvests vulnerable and patched code blocks via commit histories, enabling robust, reproducible evaluation of 10 LLMs across 8 tasks with metrics such as MRR, Top-5, ROUGE-L, CodeBLEU, and cyclomatic complexity. Results reveal significant model-to-model variability in vulnerability ranking and detection, with models often oversimplifying patches and struggling to distinguish vulnerable from patched code, though performance improves with larger context windows. The work provides actionable directions for improvement, including vulnerability-focused fine-tuning, memory-enabled learning, and better prompting to achieve more reliable SVD/SVP in practice.

Abstract

Large Language Models (LLMs) have shown promise in tasks like code translation, prompting interest in their potential for automating software vulnerability detection (SVD) and patching (SVP). To further research in this area, establishing a benchmark is essential for evaluating the strengths and limitations of LLMs in these tasks. Despite their capabilities, questions remain regarding whether LLMs can accurately analyze complex vulnerabilities and generate appropriate patches. This paper introduces VulnLLMEval, a framework designed to assess the performance of LLMs in identifying and patching vulnerabilities in C code. Our study includes 307 real-world vulnerabilities extracted from the Linux kernel, creating a well-curated dataset that includes both vulnerable and patched code. This dataset, based on real-world code, provides a diverse and representative testbed for evaluating LLM performance in SVD and SVP tasks, offering a robust foundation for rigorous assessment. Our results reveal that LLMs often struggle with distinguishing between vulnerable and patched code. Furthermore, in SVP tasks, these models tend to oversimplify the code, producing solutions that may not be directly usable without further refinement.
Paper Structure (23 sections, 6 figures, 7 tables, 1 algorithm)

This paper contains 23 sections, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: VulnLLMEval Architecture
  • Figure 2: Accuracy of LLMs in Identifying Vulnerable and Patched Code Blocks Across SVD3, SVD4, SVD5, and SVD6 Prompts.
  • Figure 3: Complexity Scores
  • Figure 4: Similarity Scores (ROUGE Scores, CodeBLEU Scores)
  • Figure 5: Accuracy of LLMs Based on the Number of Lines in Vulnerable Code Blocks
  • ...and 1 more figures