Table of Contents
Fetching ...

A Practical Approach to the Automatic Classification of Security-Relevant Commits

Antonino Sabetta, Michele Bezzi

TL;DR

This paper tackles the scarcity and inconsistency of vulnerability data for open-source components by predicting security-relevant commits directly from repository data. By treating the code changes in a commit (the patch) and its log message as text documents, the authors train two classifiers—a log-message model and a patch-content model—and fuse them with a simple voting strategy to maximize precision while preserving recall. Using a SAP-internal Java dataset of 2715 labeled commits, they demonstrate that the joint model achieves precision around $0.80$ with recall around $0.43$, and that a cross-language test on Ruby data yields comparable gains ($0.82$ precision, $0.61$ recall). The results indicate that this natural-language approach enables timely, scalable vulnerability management without relying on external advisories, with potential for improvements via embeddings or deeper neural models.

Abstract

The lack of reliable sources of detailed information on the vulnerabilities of open-source software (OSS) components is a major obstacle to maintaining a secure software supply chain and an effective vulnerability management process. Standard sources of advisories and vulnerability data, such as the National Vulnerability Database (NVD), are known to suffer from poor coverage and inconsistent quality. To reduce our dependency on these sources, we propose an approach that uses machine-learning to analyze source code repositories and to automatically identify commits that are security-relevant (i.e., that are likely to fix a vulnerability). We treat the source code changes introduced by commits as documents written in natural language, classifying them using standard document classification methods. Combining independent classifiers that use information from different facets of commits, our method can yield high precision (80%) while ensuring acceptable recall (43%). In particular, the use of information extracted from the source code changes yields a substantial improvement over the best known approach in state of the art, while requiring a significantly smaller amount of training data and employing a simpler architecture.

A Practical Approach to the Automatic Classification of Security-Relevant Commits

TL;DR

This paper tackles the scarcity and inconsistency of vulnerability data for open-source components by predicting security-relevant commits directly from repository data. By treating the code changes in a commit (the patch) and its log message as text documents, the authors train two classifiers—a log-message model and a patch-content model—and fuse them with a simple voting strategy to maximize precision while preserving recall. Using a SAP-internal Java dataset of 2715 labeled commits, they demonstrate that the joint model achieves precision around with recall around , and that a cross-language test on Ruby data yields comparable gains ( precision, recall). The results indicate that this natural-language approach enables timely, scalable vulnerability management without relying on external advisories, with potential for improvements via embeddings or deeper neural models.

Abstract

The lack of reliable sources of detailed information on the vulnerabilities of open-source software (OSS) components is a major obstacle to maintaining a secure software supply chain and an effective vulnerability management process. Standard sources of advisories and vulnerability data, such as the National Vulnerability Database (NVD), are known to suffer from poor coverage and inconsistent quality. To reduce our dependency on these sources, we propose an approach that uses machine-learning to analyze source code repositories and to automatically identify commits that are security-relevant (i.e., that are likely to fix a vulnerability). We treat the source code changes introduced by commits as documents written in natural language, classifying them using standard document classification methods. Combining independent classifiers that use information from different facets of commits, our method can yield high precision (80%) while ensuring acceptable recall (43%). In particular, the use of information extracted from the source code changes yields a substantial improvement over the best known approach in state of the art, while requiring a significantly smaller amount of training data and employing a simpler architecture.

Paper Structure

This paper contains 8 sections, 1 figure.

Figures (1)

  • Figure 1: Evaluation results