Table of Contents
Fetching ...

Automating the Detection of Code Vulnerabilities by Analyzing GitHub Issues

Daniele Cipollone, Changjie Wang, Mariano Scazzariello, Simone Ferlin, Maliheh Izadi, Dejan Kostic, Marco Chiesa

TL;DR

This work addresses early detection of software vulnerabilities by analyzing GitHub issues using transformer-based methods. It introduces a CVE-linked dataset of 4,379 GitHub issues and evaluates three approaches: embedding-based classifiers, LLM-based detection, and a combined pipeline that integrates both. The combined approach achieves the best performance, roughly doubling the vulnerability detection F1 compared to a baseline, while also generating informative vulnerability descriptions. The study demonstrates the practicality of scalable, cost-efficient vulnerability detection for open-source ecosystems, enabling earlier mitigation before official disclosures.

Abstract

In today's digital landscape, the importance of timely and accurate vulnerability detection has significantly increased. This paper presents a novel approach that leverages transformer-based models and machine learning techniques to automate the identification of software vulnerabilities by analyzing GitHub issues. We introduce a new dataset specifically designed for classifying GitHub issues relevant to vulnerability detection. We then examine various classification techniques to determine their effectiveness. The results demonstrate the potential of this approach for real-world application in early vulnerability detection, which could substantially reduce the window of exploitation for software vulnerabilities. This research makes a key contribution to the field by providing a scalable and computationally efficient framework for automated detection, enabling the prevention of compromised software usage before official notifications. This work has the potential to enhance the security of open-source software ecosystems.

Automating the Detection of Code Vulnerabilities by Analyzing GitHub Issues

TL;DR

This work addresses early detection of software vulnerabilities by analyzing GitHub issues using transformer-based methods. It introduces a CVE-linked dataset of 4,379 GitHub issues and evaluates three approaches: embedding-based classifiers, LLM-based detection, and a combined pipeline that integrates both. The combined approach achieves the best performance, roughly doubling the vulnerability detection F1 compared to a baseline, while also generating informative vulnerability descriptions. The study demonstrates the practicality of scalable, cost-efficient vulnerability detection for open-source ecosystems, enabling earlier mitigation before official disclosures.

Abstract

In today's digital landscape, the importance of timely and accurate vulnerability detection has significantly increased. This paper presents a novel approach that leverages transformer-based models and machine learning techniques to automate the identification of software vulnerabilities by analyzing GitHub issues. We introduce a new dataset specifically designed for classifying GitHub issues relevant to vulnerability detection. We then examine various classification techniques to determine their effectiveness. The results demonstrate the potential of this approach for real-world application in early vulnerability detection, which could substantially reduce the window of exploitation for software vulnerabilities. This research makes a key contribution to the field by providing a scalable and computationally efficient framework for automated detection, enabling the prevention of compromised software usage before official notifications. This work has the potential to enhance the security of open-source software ecosystems.
Paper Structure (12 sections, 6 figures, 2 tables)

This paper contains 12 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Embedding-based XGBoost classifier.
  • Figure 2: Template of the user prompt used in the pipelines.
  • Figure 3: t-SNE plot of the dataset showing the distribution of issues. Relevant issues are marked in blue, while non-relevant ones are in orange.
  • Figure 4: The combined pipeline using LLM and XGBoost.
  • Figure 5: Sensitivity analysis. The vertical dashed red line represents the ratio used to train the XGBoost classifier.
  • ...and 1 more figures