Table of Contents
Fetching ...

RiskHarvester: A Risk-based Tool to Prioritize Secret Removal Efforts in Software Artifacts

Setu Kumar Basak, Tanmay Pardeshi, Bradley Reaves, Laurie Williams

TL;DR

This work tackles the problem of prioritizing secret removal in software artifacts by introducing RiskHarvester, a risk-based tool that assigns a security risk score to secret-asset pairs as the product of asset value and ease of attack. Asset value is inferred from database keywords and mapped to sensitive data categories, while ease of attack is derived from passive network information about the asset host. The approach is validated with RiskBench, a benchmark of 1,791 secret-asset pairs, and a developer survey showing that risk-score ordering significantly influences prioritization. The results demonstrate high precision and recall in identifying database keywords and hosts, and the survey confirms developers increasingly prioritize removal based on descending risk scores, highlighting practical impact for prioritizing remediation efforts in real-world development workflows.

Abstract

Since 2020, GitGuardian has been detecting checked-in hard-coded secrets in GitHub repositories. During 2020-2023, GitGuardian has observed an upward annual trend and a four-fold increase in hard-coded secrets, with 12.8 million exposed in 2023. However, removing all the secrets from software artifacts is not feasible due to time constraints and technical challenges. Additionally, the security risks of the secrets are not equal, protecting assets ranging from obsolete databases to sensitive medical data. Thus, secret removal should be prioritized by security risk reduction, which existing secret detection tools do not support. The goal of this research is to aid software practitioners in prioritizing secrets removal efforts through our security risk-based tool. We present RiskHarvester, a risk-based tool to compute a security risk score based on the value of the asset and ease of attack on a database. We calculated the value of asset by identifying the sensitive data categories present in a database from the database keywords in the source code. We utilized data flow analysis, SQL, and ORM parsing to identify the database keywords. To calculate the ease of attack, we utilized passive network analysis to retrieve the database host information. To evaluate RiskHarvester, we curated RiskBench, a benchmark of 1,791 database secret-asset pairs with sensitive data categories and host information manually retrieved from 188 GitHub repositories. RiskHarvester demonstrates precision of (95%) and recall (90%) in detecting database keywords for the value of asset and precision of (96%) and recall (94%) in detecting valid hosts for ease of attack. Finally, we conducted a survey (52 respondents) to understand whether developers prioritize secret removal based on security risk score. We found that 86% of the developers prioritized the secrets for removal with descending security risk scores.

RiskHarvester: A Risk-based Tool to Prioritize Secret Removal Efforts in Software Artifacts

TL;DR

This work tackles the problem of prioritizing secret removal in software artifacts by introducing RiskHarvester, a risk-based tool that assigns a security risk score to secret-asset pairs as the product of asset value and ease of attack. Asset value is inferred from database keywords and mapped to sensitive data categories, while ease of attack is derived from passive network information about the asset host. The approach is validated with RiskBench, a benchmark of 1,791 secret-asset pairs, and a developer survey showing that risk-score ordering significantly influences prioritization. The results demonstrate high precision and recall in identifying database keywords and hosts, and the survey confirms developers increasingly prioritize removal based on descending risk scores, highlighting practical impact for prioritizing remediation efforts in real-world development workflows.

Abstract

Since 2020, GitGuardian has been detecting checked-in hard-coded secrets in GitHub repositories. During 2020-2023, GitGuardian has observed an upward annual trend and a four-fold increase in hard-coded secrets, with 12.8 million exposed in 2023. However, removing all the secrets from software artifacts is not feasible due to time constraints and technical challenges. Additionally, the security risks of the secrets are not equal, protecting assets ranging from obsolete databases to sensitive medical data. Thus, secret removal should be prioritized by security risk reduction, which existing secret detection tools do not support. The goal of this research is to aid software practitioners in prioritizing secrets removal efforts through our security risk-based tool. We present RiskHarvester, a risk-based tool to compute a security risk score based on the value of the asset and ease of attack on a database. We calculated the value of asset by identifying the sensitive data categories present in a database from the database keywords in the source code. We utilized data flow analysis, SQL, and ORM parsing to identify the database keywords. To calculate the ease of attack, we utilized passive network analysis to retrieve the database host information. To evaluate RiskHarvester, we curated RiskBench, a benchmark of 1,791 database secret-asset pairs with sensitive data categories and host information manually retrieved from 188 GitHub repositories. RiskHarvester demonstrates precision of (95%) and recall (90%) in detecting database keywords for the value of asset and precision of (96%) and recall (94%) in detecting valid hosts for ease of attack. Finally, we conducted a survey (52 respondents) to understand whether developers prioritize secret removal based on security risk score. We found that 86% of the developers prioritized the secrets for removal with descending security risk scores.

Paper Structure

This paper contains 19 sections, 1 equation, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Asset's value can be inferred from the database name, table names, and column names from the source code.
  • Figure 2: We identified three patterns to locate database, table, and column names for each secret-asset pair in the source code.
  • Figure 3: A flow diagram for assigning ease of attack category for an asset identified in the source code.