Table of Contents
Fetching ...

Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories

Daan Hommersom, Antonino Sabetta, Bonaventura Coppola, Dario Di Nucci, Damian A. Tamburri

TL;DR

This work addresses the challenge of scattered and incomplete vulnerability data by introducing FixFinder, a three-phase system that extracts vulnerability information from advisories, filters candidate fix commits using timing and code-change heuristics, and ranks candidates via a 23-feature ML model. On a curated dataset of 1,248 advisories and 2,391 fixes, the approach achieves a top-10 recall of 84.03% and a top-1 recall of 65.06%, substantially reducing manual effort required to locate fixes in OSS repositories. The study demonstrates that time-distance features dominate predictive power, while lexical similarity contributes variably, and shows that aggressive yet careful filtering can retain high recall while dramatically shrinking candidate sets. Overall, FixFinder provides a practical, open-source baseline to automate vulnerability fix mapping, with potential for integration into SCA workflows and security tooling to improve software supply chain security.

Abstract

The lack of comprehensive sources of accurate vulnerability data represents a critical obstacle to studying and understanding software vulnerabilities (and their corrections). In this paper, we present an approach that combines heuristics stemming from practical experience and machine-learning (ML) - specifically, natural language processing (NLP) - to address this problem. Our method consists of three phases. First, an advisory record containing key information about a vulnerability is extracted from an advisory (expressed in natural language). Second, using heuristics, a subset of candidate fix commits is obtained from the source code repository of the affected project by filtering out commits that are known to be irrelevant for the task at hand. Finally, for each such candidate commit, our method builds a numerical feature vector reflecting the characteristics of the commit that are relevant to predicting its match with the advisory at hand. The feature vectors are then exploited for building a final ranked list of candidate fixing commits. The score attributed by the ML model to each feature is kept visible to the users, allowing them to interpret the predictions. We evaluated our approach using a prototype implementation named FixFinder on a manually curated data set that comprises 2,391 known fix commits corresponding to 1,248 public vulnerability advisories. When considering the top-10 commits in the ranked results, our implementation could successfully identify at least one fix commit for up to 84.03% of the vulnerabilities (with a fix commit on the first position for 65.06% of the vulnerabilities). In conclusion, our method reduces considerably the effort needed to search OSS repositories for the commits that fix known vulnerabilities.

Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories

TL;DR

This work addresses the challenge of scattered and incomplete vulnerability data by introducing FixFinder, a three-phase system that extracts vulnerability information from advisories, filters candidate fix commits using timing and code-change heuristics, and ranks candidates via a 23-feature ML model. On a curated dataset of 1,248 advisories and 2,391 fixes, the approach achieves a top-10 recall of 84.03% and a top-1 recall of 65.06%, substantially reducing manual effort required to locate fixes in OSS repositories. The study demonstrates that time-distance features dominate predictive power, while lexical similarity contributes variably, and shows that aggressive yet careful filtering can retain high recall while dramatically shrinking candidate sets. Overall, FixFinder provides a practical, open-source baseline to automate vulnerability fix mapping, with potential for integration into SCA workflows and security tooling to improve software supply chain security.

Abstract

The lack of comprehensive sources of accurate vulnerability data represents a critical obstacle to studying and understanding software vulnerabilities (and their corrections). In this paper, we present an approach that combines heuristics stemming from practical experience and machine-learning (ML) - specifically, natural language processing (NLP) - to address this problem. Our method consists of three phases. First, an advisory record containing key information about a vulnerability is extracted from an advisory (expressed in natural language). Second, using heuristics, a subset of candidate fix commits is obtained from the source code repository of the affected project by filtering out commits that are known to be irrelevant for the task at hand. Finally, for each such candidate commit, our method builds a numerical feature vector reflecting the characteristics of the commit that are relevant to predicting its match with the advisory at hand. The feature vectors are then exploited for building a final ranked list of candidate fixing commits. The score attributed by the ML model to each feature is kept visible to the users, allowing them to interpret the predictions. We evaluated our approach using a prototype implementation named FixFinder on a manually curated data set that comprises 2,391 known fix commits corresponding to 1,248 public vulnerability advisories. When considering the top-10 commits in the ranked results, our implementation could successfully identify at least one fix commit for up to 84.03% of the vulnerabilities (with a fix commit on the first position for 65.06% of the vulnerabilities). In conclusion, our method reduces considerably the effort needed to search OSS repositories for the commits that fix known vulnerabilities.

Paper Structure

This paper contains 28 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: High-level overview of the approach
  • Figure 2: The filtering procedure: The large number of commits is reduced to a set of candidate commits through selecting a subset and filtering irrelevant commits from this subset.
  • Figure 3: The distance between the fix commit timestamp and the CVE publication timestamp expressed in the number of commits (top) and the number of days (bottom).
  • Figure 4: The effect of filtering commits based on their distance from the CVE publication, based on a combination of the number of days and the number of commits. The y-axis corresponds to the vulnerabilities sorted on the number of commits that are within two-years before the release date and one-hundred days after, so all dots with the same y-value correspond to the same vulnerability. The x-axis is the number of commits between the dot and the vulnerability release date. The red and green dots correspond to known fix commits, where the green dots are fix commits that fall within the selection. The number of commits that is in the selection, based on a combination of time and a maximum number of commits, is visualized by the blue dots. So for every blue dot, there is at least one green or red dot on the same y-value.
  • Figure 6: The result of transforming tags for the GitHub repository Apache Cayenne to a sorted tags tree, whereby the inconsistent versioning is ignored (e.g., the usage of 'cayenne-parent' and tags with only major.minor).