RiskHarvester: A Risk-based Tool to Prioritize Secret Removal Efforts in Software Artifacts
Setu Kumar Basak, Tanmay Pardeshi, Bradley Reaves, Laurie Williams
TL;DR
This work tackles the problem of prioritizing secret removal in software artifacts by introducing RiskHarvester, a risk-based tool that assigns a security risk score to secret-asset pairs as the product of asset value and ease of attack. Asset value is inferred from database keywords and mapped to sensitive data categories, while ease of attack is derived from passive network information about the asset host. The approach is validated with RiskBench, a benchmark of 1,791 secret-asset pairs, and a developer survey showing that risk-score ordering significantly influences prioritization. The results demonstrate high precision and recall in identifying database keywords and hosts, and the survey confirms developers increasingly prioritize removal based on descending risk scores, highlighting practical impact for prioritizing remediation efforts in real-world development workflows.
Abstract
Since 2020, GitGuardian has been detecting checked-in hard-coded secrets in GitHub repositories. During 2020-2023, GitGuardian has observed an upward annual trend and a four-fold increase in hard-coded secrets, with 12.8 million exposed in 2023. However, removing all the secrets from software artifacts is not feasible due to time constraints and technical challenges. Additionally, the security risks of the secrets are not equal, protecting assets ranging from obsolete databases to sensitive medical data. Thus, secret removal should be prioritized by security risk reduction, which existing secret detection tools do not support. The goal of this research is to aid software practitioners in prioritizing secrets removal efforts through our security risk-based tool. We present RiskHarvester, a risk-based tool to compute a security risk score based on the value of the asset and ease of attack on a database. We calculated the value of asset by identifying the sensitive data categories present in a database from the database keywords in the source code. We utilized data flow analysis, SQL, and ORM parsing to identify the database keywords. To calculate the ease of attack, we utilized passive network analysis to retrieve the database host information. To evaluate RiskHarvester, we curated RiskBench, a benchmark of 1,791 database secret-asset pairs with sensitive data categories and host information manually retrieved from 188 GitHub repositories. RiskHarvester demonstrates precision of (95%) and recall (90%) in detecting database keywords for the value of asset and precision of (96%) and recall (94%) in detecting valid hosts for ease of attack. Finally, we conducted a survey (52 respondents) to understand whether developers prioritize secret removal based on security risk score. We found that 86% of the developers prioritized the secrets for removal with descending security risk scores.
