Table of Contents
Fetching ...

Harnessing the Power of LLMs in Source Code Vulnerability Detection

Andrew A Mahyari

TL;DR

The paper addresses cross-language source code vulnerability detection by leveraging LLMs trained on LLVM IR-derived representations. It introduces iSeVCs as a language-agnostic intermediate form and a custom LLVM IR tokenizer to feed tokens into transformer-based encoders with a classification head, trained end-to-end. On the NVD/SARD-derived dataset, the approach achieves high accuracy (e.g., Bert ~98% vs VulDeeLocator ~82%), demonstrating the effectiveness of NLP-style models for code vulnerability analysis. This work highlights the potential of language-model-based static analysis with intermediate representations to enable universal vulnerability detection across programming languages and motivates further exploration of vulnerability taxonomy generalization and dataset diversity.

Abstract

Software vulnerabilities, caused by unintentional flaws in source code, are a primary root cause of cyberattacks. Static analysis of source code has been widely used to detect these unintentional defects introduced by software developers. Large Language Models (LLMs) have demonstrated human-like conversational abilities due to their capacity to capture complex patterns in sequential data, such as natural languages. In this paper, we harness LLMs' capabilities to analyze source code and detect known vulnerabilities. To ensure the proposed vulnerability detection method is universal across multiple programming languages, we convert source code to LLVM IR and train LLMs on these intermediate representations. We conduct extensive experiments on various LLM architectures and compare their accuracy. Our comprehensive experiments on real-world and synthetic codes from NVD and SARD demonstrate high accuracy in identifying source code vulnerabilities.

Harnessing the Power of LLMs in Source Code Vulnerability Detection

TL;DR

The paper addresses cross-language source code vulnerability detection by leveraging LLMs trained on LLVM IR-derived representations. It introduces iSeVCs as a language-agnostic intermediate form and a custom LLVM IR tokenizer to feed tokens into transformer-based encoders with a classification head, trained end-to-end. On the NVD/SARD-derived dataset, the approach achieves high accuracy (e.g., Bert ~98% vs VulDeeLocator ~82%), demonstrating the effectiveness of NLP-style models for code vulnerability analysis. This work highlights the potential of language-model-based static analysis with intermediate representations to enable universal vulnerability detection across programming languages and motivates further exploration of vulnerability taxonomy generalization and dataset diversity.

Abstract

Software vulnerabilities, caused by unintentional flaws in source code, are a primary root cause of cyberattacks. Static analysis of source code has been widely used to detect these unintentional defects introduced by software developers. Large Language Models (LLMs) have demonstrated human-like conversational abilities due to their capacity to capture complex patterns in sequential data, such as natural languages. In this paper, we harness LLMs' capabilities to analyze source code and detect known vulnerabilities. To ensure the proposed vulnerability detection method is universal across multiple programming languages, we convert source code to LLVM IR and train LLMs on these intermediate representations. We conduct extensive experiments on various LLM architectures and compare their accuracy. Our comprehensive experiments on real-world and synthetic codes from NVD and SARD demonstrate high accuracy in identifying source code vulnerabilities.
Paper Structure (9 sections, 2 equations, 2 figures, 1 table)

This paper contains 9 sections, 2 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: The overall architecture of the proposed vulnerability detection algorithm. First, source codes are converted to LLVM IRs, then LLVM IRs are converted to iSeVCs, and then the tokenizer converts them to unique IDs. The output of LLM is used to predict whether the whole source code is vulnerable.
  • Figure 2: The accuracy vs number of FC layers of the classifier head: (a) Bert; (b) DistilBert.