Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection?

Yu Zhao; Lina Gong; Zhiqiu Huang; Yongwei Wang; Mingqiang Wei; Fei Wu

Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection?

Yu Zhao, Lina Gong, Zhiqiu Huang, Yongwei Wang, Mingqiang Wei, Fei Wu

TL;DR

The paper tackles vulnerability detection by systematically evaluating how code embeddings from ten code PTMs influence performance, revealing that embedding quality does not scale monotonically with model size. It introduces a thirteen-metric framework across statistics, norm, and distribution to characterize embeddings and trains a Random Forest recommender that links these metrics to embedding quality, achieving $0.91$ accuracy and $0.88$ AUC on test data. A 8000-instance code embedding dataset and SHAP-based feature analysis demonstrate that metrics such as standard deviation and variance are strong predictors of embedding usefulness. Practically, the framework achieves $78\%$ consistency in selecting top-performing PTMs on unseen data, providing a tangible tool for practitioners to choose appropriate code PTMs for vulnerability detection and guiding future work in code-related tasks.

Abstract

Vulnerability detection is garnering increasing attention in software engineering, since code vulnerabilities possibly pose significant security. Recently, reusing various code pre-trained models has become common for code embedding without providing reasonable justifications in vulnerability detection. The premise for casually utilizing pre-trained models (PTMs) is that the code embeddings generated by different PTMs would generate a similar impact on the performance. Is that TRUE? To answer this important question, we systematically investigate the effects of code embedding generated by ten different code PTMs on the performance of vulnerability detection, and get the answer, i.e., that is NOT true. We observe that code embedding generated by various code PTMs can indeed influence the performance and selecting an embedding technique based on parameter scales and embedding dimension is not reliable. Our findings highlight the necessity of quantifying and evaluating the characteristics of code embedding generated by various code PTMs to understand the effects. To achieve this goal, we analyze the numerical representation and data distribution of code embedding generated by different PTMs to evaluate differences and characteristics. Based on these insights, we propose Coding-PTMs, a recommendation framework to assist engineers in selecting optimal code PTMs for their specific vulnerability detection tasks. Specifically, we define thirteen code embedding metrics across three dimensions (i.e., statistics, norm, and distribution) for constructing a specialized code PTM recommendation dataset. We then employ a Random Forest classifier to train a recommendation model and identify the optimal code PTMs from the candidate model zoo.

Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection?

TL;DR

accuracy and

AUC on test data. A 8000-instance code embedding dataset and SHAP-based feature analysis demonstrate that metrics such as standard deviation and variance are strong predictors of embedding usefulness. Practically, the framework achieves

consistency in selecting top-performing PTMs on unseen data, providing a tangible tool for practitioners to choose appropriate code PTMs for vulnerability detection and guiding future work in code-related tasks.

Abstract

Paper Structure (16 sections, 6 figures, 4 tables)

This paper contains 16 sections, 6 figures, 4 tables.

Introduction
Related Work
Experimental Data
Code pre-trained models
Vulnerability detection datasets
Preliminary Study
Formative Study
Proposed Framework to Recommend Optimal Code Pre-trained Models
Define code embedding metrics
Construct code embedding datasets
Construct machine learning models for recommendation
Assess Recommendation Framework
Apply in Practice
Guidelines for Using Our Framework
Threats to Validity
...and 1 more sections

Figures (6)

Figure 1: Performance distribution of the AUC values obtained on the corresponding test data of the classifier built based on ten different code embeddings generated by four different code PTMs on the four datasets of the vulnerability detection task.
Figure 2: Numerical distributions of the code embeddings generated by different PTMs on vulnerability detection task, where the abscissa represents the value of the code embeddings, and the ordinate represents the frequency of the corresponding value, which is represented in logarithmic form.
Figure 3: Data distributions of the code embeddings generated by different PTMs on vulnerability detection task, where the abscissa represents vector L2 norm's value, and the ordinate represents the frequency of the corresponding value, which is represented in logarithmic form.
Figure 4: Recommendation framework.
Figure 5: Feature importance rank of the RF model.
...and 1 more figures

Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection?

TL;DR

Abstract

Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection?

Authors

TL;DR

Abstract

Table of Contents

Figures (6)