Table of Contents
Fetching ...

Statement-Level Vulnerability Detection: Learning Vulnerability Patterns Through Information Theory and Contrastive Learning

Van Nguyen, Trung Le, Chakkrit Tantithamthavorn, Michael Fu, John Grundy, Hung Nguyen, Seyit Camtepe, Paul Quirk, Dinh Phung

TL;DR

This work tackles statement-level vulnerability detection in large code sections where vulnerabilities are sparse within functions. It introduces LEAP, an end-to-end framework that selects vulnerability-relevant statements via a learnable Bernoulli selector and optimizes this selection by maximizing the mutual information $\mathbb{I}(\tilde{F},Y)$, where $\tilde{F}$ is the selected subset; it additionally imposes a clustered spatial contrastive learning term to capture reusable vulnerability patterns across functions. Empirical results on real-world datasets CWE-399, CWE-119, and Big-Vul show that LEAP achieves higher vulnerability coverage proportion (VCP), vulnerability coverage accuracy (VCA), and Top-10 accuracy than baselines, with improvements of about 3–14 percentage points, and benefits further from semi-supervised labeling. Ablation studies demonstrate the contributions of mutual information and CSCL, while additional experiments and auxiliary metrics highlight stability and interpretability; the authors also release code to support reproducibility and practical adoption.

Abstract

Software vulnerabilities are a serious and crucial concern. Typically, in a program or function consisting of hundreds or thousands of source code statements, there are only a few statements causing the corresponding vulnerabilities. Most current approaches to vulnerability labelling are done on a function or program level by experts with the assistance of machine learning tools. Extending this approach to the code statement level is much more costly and time-consuming and remains an open problem. In this paper, we propose a novel end-to-end deep learning-based approach to identify the vulnerability-relevant code statements of a specific function. Inspired by the specific structures observed in real-world vulnerable code, we first leverage mutual information for learning a set of latent variables representing the relevance of the source code statements to the corresponding function's vulnerability. We then propose novel clustered spatial contrastive learning in order to further improve the representation learning and the robust selection process of vulnerability-relevant code statements. Experimental results on real-world datasets of 200k+ C/C++ functions show the superiority of our method over other state-of-the-art baselines. In general, our method obtains a higher performance in VCP, VCA, and Top-10 ACC measures of between 3% to 14% over the baselines when running on real-world datasets in an unsupervised setting. Our released source code samples are publicly available at \href{https://github.com/vannguyennd/livuitcl}{https://github.com/vannguyennd/livuitcl.}

Statement-Level Vulnerability Detection: Learning Vulnerability Patterns Through Information Theory and Contrastive Learning

TL;DR

This work tackles statement-level vulnerability detection in large code sections where vulnerabilities are sparse within functions. It introduces LEAP, an end-to-end framework that selects vulnerability-relevant statements via a learnable Bernoulli selector and optimizes this selection by maximizing the mutual information , where is the selected subset; it additionally imposes a clustered spatial contrastive learning term to capture reusable vulnerability patterns across functions. Empirical results on real-world datasets CWE-399, CWE-119, and Big-Vul show that LEAP achieves higher vulnerability coverage proportion (VCP), vulnerability coverage accuracy (VCA), and Top-10 accuracy than baselines, with improvements of about 3–14 percentage points, and benefits further from semi-supervised labeling. Ablation studies demonstrate the contributions of mutual information and CSCL, while additional experiments and auxiliary metrics highlight stability and interpretability; the authors also release code to support reproducibility and practical adoption.

Abstract

Software vulnerabilities are a serious and crucial concern. Typically, in a program or function consisting of hundreds or thousands of source code statements, there are only a few statements causing the corresponding vulnerabilities. Most current approaches to vulnerability labelling are done on a function or program level by experts with the assistance of machine learning tools. Extending this approach to the code statement level is much more costly and time-consuming and remains an open problem. In this paper, we propose a novel end-to-end deep learning-based approach to identify the vulnerability-relevant code statements of a specific function. Inspired by the specific structures observed in real-world vulnerable code, we first leverage mutual information for learning a set of latent variables representing the relevance of the source code statements to the corresponding function's vulnerability. We then propose novel clustered spatial contrastive learning in order to further improve the representation learning and the robust selection process of vulnerability-relevant code statements. Experimental results on real-world datasets of 200k+ C/C++ functions show the superiority of our method over other state-of-the-art baselines. In general, our method obtains a higher performance in VCP, VCA, and Top-10 ACC measures of between 3% to 14% over the baselines when running on real-world datasets in an unsupervised setting. Our released source code samples are publicly available at \href{https://github.com/vannguyennd/livuitcl}{https://github.com/vannguyennd/livuitcl.}
Paper Structure (33 sections, 9 equations, 8 figures, 4 tables)

This paper contains 33 sections, 9 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: An example of a buffer error vulnerability source code function.
  • Figure 2: The overall architecture of our LEAP method. Given a mini-batch of source code sections (i.e., from $F^{1}$ to $F^{i}$), to each code section, e.g., $F^{i}$, the selection process $\varepsilon$ learns a set of independent Bernoulli latent variables $\mathbf{z}\in\{0,1\}^{L}$ representing the relevance of the code statements to the corresponding function's vulnerability $Y^{i}$. For demonstration purposes, we assume that there are six code statements in $F^{i}$. However, in reality, this number can be in the hundreds. We then construct $\tilde{F}^{i}$ (i.e., the subset of code statements that actually lead to the vulnerability $Y^{i}$) by $\tilde{F}^{i}=\mathbf{z}^{i}(F^{i})\odot F^{i}$. Importantly, for ensuring and enforcing $\varepsilon$ obtaining the most meaningful $\tilde{F}^{i}$ (i.e., $\tilde{F}^{i}$ can predict the vulnerability $Y^{i}$ of $F^{i}$ correctly), we maximize the mutual information between $\tilde{F}^{i}$ and $Y^{i}$. The proposed clustered spatial contrastive learning helps to learn and enforce important properties in the source code data for boosting the data representation learning in figuring out and selecting vulnerable patterns and vulnerable statements in each vulnerable source code section.
  • Figure 3: An example of the improper validation of array index flaw pattern (i.e., the top-left-hand figure) with two real-world source code functions (i.e., takeArrayValue and getValue) containing this pattern. In each function, there are some parts (i.e., denoted by “ ...”) omitted for brevity.
  • Figure 4: A graphic showing two vectors with cosine similarities close to 1, close to 0, and close to -1. The similarity of two vectors is measured by the cosine of the angle between them. The similarity can take values between -1 and +1. Smaller angles between vectors produce larger cosine values, indicating greater cosine similarity.
  • Figure 5: A demonstration of different vulnerability patterns forming different patterns in the latent space for the buffer overflow error. In particular, Pattern 1 stands for the expired pointer dereference flaw in which the program dereferences a pointer containing a location for memory that was previously valid, but it is no longer valid. Pattern 2 represents the improper validation of the array index flaw in which the product uses untrusted input when using an array index, but the product does not validate or incorrectly validates the index to ensure the index references a valid position within the array while Pattern 3 presents the buffer access with an incorrect length value flaw in which the software uses a sequential operation to read or write a buffer, but it may use an incorrect length value resulting in accessing memory that is outside of the bounds of the buffer. Note that each data point in a pattern is a specific $F^{top}$ (e.g., the colored background lines) of a corresponding function $F$. In this demonstration, we assume that there are three different patterns causing the buffer overflow error. In reality, the number of vulnerability patterns causing the buffer overflow error can be higher.
  • ...and 3 more figures