Table of Contents
Fetching ...

VulLibGen: Generating Names of Vulnerability-Affected Packages via a Large Language Model

Tianyu Chen, Lin Li, Liuchuan Zhu, Zongyang Li, Xueqing Liu, Guangtai Liang, Qianxiang Wang, Tao Xie

TL;DR

Vulnerability reports require accurate extraction of affected package names to support defense; prior methods rely on ranking with small models and struggle across large ecosystems. The authors introduce VulLibGen, a generation-based framework that leverages large language models augmented with supervised fine-tuning, retrieval-augmented generation, and a novel local search postprocessing to reduce hallucinations. Across four ecosystems (Java, JS, Python, Go), VulLibGen achieves an average Accuracy@1 of 0.806, outperforming the best prior method at 0.721, with statistical significance (p<1e-5). The approach demonstrates practical impact by submitting 60 vulnerability-package pairs to GitHub Advisory, with 34 accepted and merged and 20 pending; code and data are released.

Abstract

Security practitioners maintain vulnerability reports (e.g., GitHub Advisory) to help developers mitigate security risks. An important task for these databases is automatically extracting structured information mentioned in the report, e.g., the affected software packages, to accelerate the defense of the vulnerability ecosystem. However, it is challenging for existing work on affected package identification to achieve a high accuracy. One reason is that all existing work focuses on relatively smaller models, thus they cannot harness the knowledge and semantic capabilities of large language models. To address this limitation, we propose VulLibGen, the first method to use LLM for affected package identification. In contrast to existing work, VulLibGen proposes the novel idea to directly generate the affected package. To improve the accuracy, VulLibGen employs supervised fine-tuning (SFT), retrieval augmented generation (RAG) and a local search algorithm. The local search algorithm is a novel postprocessing algorithm we introduce for reducing the hallucination of the generated packages. Our evaluation results show that VulLibGen has an average accuracy of 0.806 for identifying vulnerable packages in the four most popular ecosystems in GitHub Advisory (Java, JS, Python, Go) while the best average accuracy in previous work is 0.721. Additionally, VulLibGen has high value to security practice: we submitted 60 <vulnerability, affected package> pairs to GitHub Advisory (covers four ecosystems). 34 of them have been accepted and merged and 20 are pending approval. Our code and dataset can be found in the attachments.

VulLibGen: Generating Names of Vulnerability-Affected Packages via a Large Language Model

TL;DR

Vulnerability reports require accurate extraction of affected package names to support defense; prior methods rely on ranking with small models and struggle across large ecosystems. The authors introduce VulLibGen, a generation-based framework that leverages large language models augmented with supervised fine-tuning, retrieval-augmented generation, and a novel local search postprocessing to reduce hallucinations. Across four ecosystems (Java, JS, Python, Go), VulLibGen achieves an average Accuracy@1 of 0.806, outperforming the best prior method at 0.721, with statistical significance (p<1e-5). The approach demonstrates practical impact by submitting 60 vulnerability-package pairs to GitHub Advisory, with 34 accepted and merged and 20 pending; code and data are released.

Abstract

Security practitioners maintain vulnerability reports (e.g., GitHub Advisory) to help developers mitigate security risks. An important task for these databases is automatically extracting structured information mentioned in the report, e.g., the affected software packages, to accelerate the defense of the vulnerability ecosystem. However, it is challenging for existing work on affected package identification to achieve a high accuracy. One reason is that all existing work focuses on relatively smaller models, thus they cannot harness the knowledge and semantic capabilities of large language models. To address this limitation, we propose VulLibGen, the first method to use LLM for affected package identification. In contrast to existing work, VulLibGen proposes the novel idea to directly generate the affected package. To improve the accuracy, VulLibGen employs supervised fine-tuning (SFT), retrieval augmented generation (RAG) and a local search algorithm. The local search algorithm is a novel postprocessing algorithm we introduce for reducing the hallucination of the generated packages. Our evaluation results show that VulLibGen has an average accuracy of 0.806 for identifying vulnerable packages in the four most popular ecosystems in GitHub Advisory (Java, JS, Python, Go) while the best average accuracy in previous work is 0.721. Additionally, VulLibGen has high value to security practice: we submitted 60 <vulnerability, affected package> pairs to GitHub Advisory (covers four ecosystems). 34 of them have been accepted and merged and 20 are pending approval. Our code and dataset can be found in the attachments.
Paper Structure (20 sections, 3 figures, 14 tables, 1 algorithm)

This paper contains 20 sections, 3 figures, 14 tables, 1 algorithm.

Figures (3)

  • Figure 1: GitHub Advisory Report for https://github.com/advisories/GHSA-485q-v457-3p58
  • Figure 2: The VulLibGen Framework
  • Figure 3: Trade-Offs between Efficiency and Accuracy