Pre-training by Predicting Program Dependencies for Vulnerability Analysis Tasks
Zhongxin Liu, Zhijie Tang, Junwei Zhang, Xin Xia, Xiaohu Yang
TL;DR
This paper addresses vulnerability analysis by enriching pre-training with program-dependence knowledge. It introduces two objectives, CDP and DDP, to teach a Transformer model (PDBERT) to predict statement-level control and token-level data dependencies directly from source code, enabling handling of partial code. PDBERT, initialized from CodeBERT and trained on about 1.9M C/C++ functions, achieves state-of-the-art results across vulnerability detection, classification, and assessment, with intrinsic dependency-prediction F1s exceeding $99\%$ for control and $>94\%$ for data, and a throughput about $23\times$ faster than Joern. Ablation shows both CDP and DDP contribute to improvements over strong baselines, suggesting end-to-end program-dependence knowledge generalizes well to downstream tasks. The approach offers practical benefits for analyzing vulnerable code snippets that cannot be fully parsed and complements static-analysis tools in high-throughput settings.
Abstract
Vulnerability analysis is crucial for software security. This work focuses on using pre-training techniques to enhance the understanding of vulnerable code and boost vulnerability analysis. The code understanding ability of a pre-trained model is highly related to its pre-training objectives. The semantic structure, e.g., control and data dependencies, of code is important for vulnerability analysis. However, existing pre-training objectives either ignore such structure or focus on learning to use it. The feasibility and benefits of learning the knowledge of analyzing semantic structure have not been investigated. To this end, this work proposes two novel pre-training objectives, namely Control Dependency Prediction (CDP) and Data Dependency Prediction (DDP), which aim to predict the statement-level control dependencies and token-level data dependencies, respectively, in a code snippet only based on its source code. During pre-training, CDP and DDP can guide the model to learn the knowledge required for analyzing fine-grained dependencies in code. After pre-training, the pre-trained model can boost the understanding of vulnerable code during fine-tuning and can directly be used to perform dependence analysis for both partial and complete functions. To demonstrate the benefits of our pre-training objectives, we pre-train a Transformer model named PDBERT with CDP and DDP, fine-tune it on three vulnerability analysis tasks, i.e., vulnerability detection, vulnerability classification, and vulnerability assessment, and also evaluate it on program dependence analysis. Experimental results show that PDBERT benefits from CDP and DDP, leading to state-of-the-art performance on the three downstream tasks. Also, PDBERT achieves F1-scores of over 99% and 94% for predicting control and data dependencies, respectively, in partial and complete functions.
