Table of Contents
Fetching ...

Can Highlighting Help GitHub Maintainers Track Security Fixes?

Xueqing Liu, Yuchen Xiong, Qiushi Liu, Jiangrui Zheng

TL;DR

This work investigates using LIME (a widely used explainable machine learning method) to highlight the rationale tokens in the commit message and code and proposes an explanation method called TfIdf-Highlight, which leverages the Tf-Idf statistics to select the most informative words in the repository and the dataset.

Abstract

In recent years, the rapid growth of security vulnerabilities poses great challenges to tracing and managing them. For example, it was reported that the NVD database experienced significant delays due to the shortage of maintainers. Such delay creates challenges for third-party security personnel (e.g., administrators) to trace the information related to the CVE. To help security personnel trace a vulnerability patch, we build a retrieval system that automatically retrieves the patch in the repository. Inspired by existing work on explainable machine learning, we ask the following research question: can explanations help security maintainers make decisions in patch tracing? First, we investigate using LIME (a widely used explainable machine learning method) to highlight the rationale tokens in the commit message and code. In addition, we propose an explanation method called TfIdf-Highlight, which leverages the Tf-Idf statistics to select the most informative words in the repository and the dataset. We evaluate the effectiveness of highlighting using two experiments. First, we compare LIME and TfIdf-Highlight using a faithfulness score (i.e., sufficiency and comprehensiveness) defined for ranking. We find that TfIdf-Highlight significantly outperforms LIME's sufficiency scores by 15\% and slightly outperforms the comprehensiveness scores. Second, we conduct a blind human labeling experiment by asking the annotators to guess the patch under 3 settings (TfIdf-Highlight, LIME, and no highlight). We find that the helpfulness score for TfIdf-Highlight is higher than LIME while the labeling accuracies of LIME and TfIdf-Highlight are similar. Nevertheless, highlighting does not improve the accuracy over non-highlighting.

Can Highlighting Help GitHub Maintainers Track Security Fixes?

TL;DR

This work investigates using LIME (a widely used explainable machine learning method) to highlight the rationale tokens in the commit message and code and proposes an explanation method called TfIdf-Highlight, which leverages the Tf-Idf statistics to select the most informative words in the repository and the dataset.

Abstract

In recent years, the rapid growth of security vulnerabilities poses great challenges to tracing and managing them. For example, it was reported that the NVD database experienced significant delays due to the shortage of maintainers. Such delay creates challenges for third-party security personnel (e.g., administrators) to trace the information related to the CVE. To help security personnel trace a vulnerability patch, we build a retrieval system that automatically retrieves the patch in the repository. Inspired by existing work on explainable machine learning, we ask the following research question: can explanations help security maintainers make decisions in patch tracing? First, we investigate using LIME (a widely used explainable machine learning method) to highlight the rationale tokens in the commit message and code. In addition, we propose an explanation method called TfIdf-Highlight, which leverages the Tf-Idf statistics to select the most informative words in the repository and the dataset. We evaluate the effectiveness of highlighting using two experiments. First, we compare LIME and TfIdf-Highlight using a faithfulness score (i.e., sufficiency and comprehensiveness) defined for ranking. We find that TfIdf-Highlight significantly outperforms LIME's sufficiency scores by 15\% and slightly outperforms the comprehensiveness scores. Second, we conduct a blind human labeling experiment by asking the annotators to guess the patch under 3 settings (TfIdf-Highlight, LIME, and no highlight). We find that the helpfulness score for TfIdf-Highlight is higher than LIME while the labeling accuracies of LIME and TfIdf-Highlight are similar. Nevertheless, highlighting does not improve the accuracy over non-highlighting.

Paper Structure

This paper contains 18 sections, 2 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: An example of NVD's missing patch link
  • Figure 2: The proportion of available patches in NVD compared to GitHub Advisory database (average: 0.3). That is, for all the CVEs in GitHub Advisory with at least 1 patch link, we examine their corresponding NVD pages where the patch link is not missing.
  • Figure 3: The retrieval model for the multi-modal training
  • Figure 4: The trend of explainability score vs. the number of highlighted tokens. Model: CodeBERT, Fold: valid
  • Figure 5: An example of the human labeling interface