Table of Contents
Fetching ...

PatchFinder: A Two-Phase Approach to Security Patch Tracing for Disclosed Vulnerabilities in Open-Source Software

Kaixuan Li, Jian Zhang, Sen Chen, Han Liu, Yang Liu, Yixiang Chen

TL;DR

This work tackles the problem of tracing security patches for disclosed OSS vulnerabilities when CVEs lack direct patch links. It introduces PatchFinder, a two-phase framework that first narrows candidate commits via a hybrid lexical-semantic retriever and then re-ranks these candidates using a fine-tuned CodeReviewer encoder to learn end-to-end semantic correlations between CVE descriptions and code changes. Across 4,789 CVEs from 532 OSS projects, PatchFinder achieves Recall@10 of $80.63\%$ and MRR of $0.7951$, while reducing Manual Effort@10 to $2.77$, outperforming state-of-the-art baselines and enabling practical patch identification, with 482 CNAs confirming patches found via the method. The approach demonstrates strong effectiveness, robust ablation results showing the necessity of the two phases, and tangible real-world impact by discovering patches for CVEs lacking trace links in public databases, underscoring its value for OSS security maintenance and vulnerability management.

Abstract

Open-source software (OSS) vulnerabilities are increasingly prevalent, emphasizing the importance of security patches. However, in widely used security platforms like NVD, a substantial number of CVE records still lack trace links to patches. Although rank-based approaches have been proposed for security patch tracing, they heavily rely on handcrafted features in a single-step framework, which limits their effectiveness. In this paper, we propose PatchFinder, a two-phase framework with end-to-end correlation learning for better-tracing security patches. In the **initial retrieval** phase, we employ a hybrid patch retriever to account for both lexical and semantic matching based on the code changes and the description of a CVE, to narrow down the search space by extracting those commits as candidates that are similar to the CVE descriptions. Afterwards, in the **re-ranking** phase, we design an end-to-end architecture under the supervised fine-tuning paradigm for learning the semantic correlations between CVE descriptions and commits. In this way, we can automatically rank the candidates based on their correlation scores while maintaining low computation overhead. We evaluated our system against 4,789 CVEs from 532 OSS projects. The results are highly promising: PatchFinder achieves a Recall@10 of 80.63% and a Mean Reciprocal Rank (MRR) of 0.7951. Moreover, the Manual Effort@10 required is curtailed to 2.77, marking a 1.94 times improvement over current leading methods. When applying PatchFinder in practice, we initially identified 533 patch commits and submitted them to the official, 482 of which have been confirmed by CVE Numbering Authorities.

PatchFinder: A Two-Phase Approach to Security Patch Tracing for Disclosed Vulnerabilities in Open-Source Software

TL;DR

This work tackles the problem of tracing security patches for disclosed OSS vulnerabilities when CVEs lack direct patch links. It introduces PatchFinder, a two-phase framework that first narrows candidate commits via a hybrid lexical-semantic retriever and then re-ranks these candidates using a fine-tuned CodeReviewer encoder to learn end-to-end semantic correlations between CVE descriptions and code changes. Across 4,789 CVEs from 532 OSS projects, PatchFinder achieves Recall@10 of and MRR of , while reducing Manual Effort@10 to , outperforming state-of-the-art baselines and enabling practical patch identification, with 482 CNAs confirming patches found via the method. The approach demonstrates strong effectiveness, robust ablation results showing the necessity of the two phases, and tangible real-world impact by discovering patches for CVEs lacking trace links in public databases, underscoring its value for OSS security maintenance and vulnerability management.

Abstract

Open-source software (OSS) vulnerabilities are increasingly prevalent, emphasizing the importance of security patches. However, in widely used security platforms like NVD, a substantial number of CVE records still lack trace links to patches. Although rank-based approaches have been proposed for security patch tracing, they heavily rely on handcrafted features in a single-step framework, which limits their effectiveness. In this paper, we propose PatchFinder, a two-phase framework with end-to-end correlation learning for better-tracing security patches. In the **initial retrieval** phase, we employ a hybrid patch retriever to account for both lexical and semantic matching based on the code changes and the description of a CVE, to narrow down the search space by extracting those commits as candidates that are similar to the CVE descriptions. Afterwards, in the **re-ranking** phase, we design an end-to-end architecture under the supervised fine-tuning paradigm for learning the semantic correlations between CVE descriptions and commits. In this way, we can automatically rank the candidates based on their correlation scores while maintaining low computation overhead. We evaluated our system against 4,789 CVEs from 532 OSS projects. The results are highly promising: PatchFinder achieves a Recall@10 of 80.63% and a Mean Reciprocal Rank (MRR) of 0.7951. Moreover, the Manual Effort@10 required is curtailed to 2.77, marking a 1.94 times improvement over current leading methods. When applying PatchFinder in practice, we initially identified 533 patch commits and submitted them to the official, 482 of which have been confirmed by CVE Numbering Authorities.
Paper Structure (31 sections, 12 equations, 2 figures, 5 tables)

This paper contains 31 sections, 12 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Overview of our approach.
  • Figure 2: The workflow of our Semantic-based Retriever.