Table of Contents
Fetching ...

SCALE: Constructing Structured Natural Language Comment Trees for Software Vulnerability Detection

Xin-Cheng Wen, Cuiyun Gao, Shuzheng Gao, Yang Xiao, Michael R. Lyu

TL;DR

SCALE addresses two key gaps in vulnerability detection: capturing complex code semantics and decoding non-sequential execution paths. It introduces a Structured Natural Language Comment Tree (SCT) built by combining LLM-generated comments with ASTs, and applies a set of structured linguistic rules to encode execution sequences within the SCTs. The SCT-Enhanced Representation, via cross-attention fusion of code and SCT embeddings, yields superior vulnerability-detection performance across three datasets and generalizes to other pre-trained models. Ablation and sensitivity analyses confirm that both SCTC and SER modules, as well as the integrated rule set, contribute to improvements, underscoring the value of structured, cross-modal code understanding for software security. The approach demonstrates meaningful practical impact by improving detection accuracy and offering a framework adaptable to multiple code-model backbones.

Abstract

Recently, there has been a growing interest in automatic software vulnerability detection. Pre-trained model-based approaches have demonstrated superior performance than other Deep Learning (DL)-based approaches in detecting vulnerabilities. However, the existing pre-trained model-based approaches generally employ code sequences as input during prediction, and may ignore vulnerability-related structural information, as reflected in the following two aspects. First, they tend to fail to infer the semantics of the code statements with complex logic such as those containing multiple operators and pointers. Second, they are hard to comprehend various code execution sequences, which is essential for precise vulnerability detection. To mitigate the challenges, we propose a Structured Natural Language Comment tree-based vulnerAbiLity dEtection framework based on the pre-trained models, named SCALE. The proposed Structured Natural Language Comment Tree (SCT) integrates the semantics of code statements with code execution sequences based on the Abstract Syntax Trees (ASTs). Specifically, SCALE comprises three main modules: (1) Comment Tree Construction, which aims at enhancing the model's ability to infer the semantics of code statements by first incorporating Large Language Models (LLMs) for comment generation and then adding the comment node to ASTs. (2) Structured Natural Language Comment Tree Construction}, which aims at explicitly involving code execution sequence by combining the code syntax templates with the comment tree. (3) SCT-Enhanced Representation, which finally incorporates the constructed SCTs for well capturing vulnerability patterns.

SCALE: Constructing Structured Natural Language Comment Trees for Software Vulnerability Detection

TL;DR

SCALE addresses two key gaps in vulnerability detection: capturing complex code semantics and decoding non-sequential execution paths. It introduces a Structured Natural Language Comment Tree (SCT) built by combining LLM-generated comments with ASTs, and applies a set of structured linguistic rules to encode execution sequences within the SCTs. The SCT-Enhanced Representation, via cross-attention fusion of code and SCT embeddings, yields superior vulnerability-detection performance across three datasets and generalizes to other pre-trained models. Ablation and sensitivity analyses confirm that both SCTC and SER modules, as well as the integrated rule set, contribute to improvements, underscoring the value of structured, cross-modal code understanding for software security. The approach demonstrates meaningful practical impact by improving detection accuracy and offering a framework adaptable to multiple code-model backbones.

Abstract

Recently, there has been a growing interest in automatic software vulnerability detection. Pre-trained model-based approaches have demonstrated superior performance than other Deep Learning (DL)-based approaches in detecting vulnerabilities. However, the existing pre-trained model-based approaches generally employ code sequences as input during prediction, and may ignore vulnerability-related structural information, as reflected in the following two aspects. First, they tend to fail to infer the semantics of the code statements with complex logic such as those containing multiple operators and pointers. Second, they are hard to comprehend various code execution sequences, which is essential for precise vulnerability detection. To mitigate the challenges, we propose a Structured Natural Language Comment tree-based vulnerAbiLity dEtection framework based on the pre-trained models, named SCALE. The proposed Structured Natural Language Comment Tree (SCT) integrates the semantics of code statements with code execution sequences based on the Abstract Syntax Trees (ASTs). Specifically, SCALE comprises three main modules: (1) Comment Tree Construction, which aims at enhancing the model's ability to infer the semantics of code statements by first incorporating Large Language Models (LLMs) for comment generation and then adding the comment node to ASTs. (2) Structured Natural Language Comment Tree Construction}, which aims at explicitly involving code execution sequence by combining the code syntax templates with the comment tree. (3) SCT-Enhanced Representation, which finally incorporates the constructed SCTs for well capturing vulnerability patterns.
Paper Structure (34 sections, 3 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 34 sections, 3 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: A code example that presents a memory leak vulnerability in the Qemu project, and is misclassified by UniXcoder as non-vulnerable. The red and green lines represent the patched code segments before and after fixed, respectively.
  • Figure 2: Four types of existing vulnerability detection methods.
  • Figure 3: The overview of SCALE.
  • Figure 4: An example of structured natural language comment in SCALE. The red and grey lines denote the original source code and structured natural language comment, respectively.
  • Figure 5: The example of the comment tree and SCT. The black and red font denotes the node type and value, respectively. The yellow, green, blue, red, and gray-shaded nodes denote the original nodes, comment nodes, target nodes, nodes to replace, and deleted nodes, respectively.
  • ...and 2 more figures