Secret Breach Prevention in Software Issue Reports
Sadif Ahmed, Md Nafiu Rahman, Zahin Wahab, Gias Uddin, Rifat Shahriyar
TL;DR
The paper tackles the overlooked problem of accidental secret leakage in software issue reports and introduces a large-scale benchmark of 54,148 labeled instances. It proposes a detection pipeline that combines regex-based candidate extraction with contextual classification by both proprietary and open-source language models, demonstrating substantial gains over traditional regex/entropy baselines. Key findings show small Bert-like models and open-source LLMs achieving up to 94% F1 on the task, while GPT-4o offers strong but more modest performance in prompting settings, and real-world repos yield an 81.6% macro F1, indicating strong generalization. The work provides a practical path to deploying automated secret detection in issue trackers and highlights remaining challenges, such as handling ambiguous placeholders and long cryptographic material, while offering replication-ready data and models for further research.
Abstract
In the digital era, accidental exposure of sensitive information such as API keys, tokens, and credentials is a growing security threat. While most prior work focuses on detecting secrets in source code, leakage in software issue reports remains largely unexplored. This study fills that gap through a large-scale analysis and a practical detection pipeline for exposed secrets in GitHub issues. Our pipeline combines regular expression-based extraction with large language model (LLM) based contextual classification to detect real secrets and reduce false positives. We build a benchmark of 54,148 instances from public GitHub issues, including 5,881 manually verified true secrets. Using this dataset, we evaluate entropy-based baselines and keyword heuristics used by prior secret detection tools, classical machine learning, deep learning, and LLM-based methods. Regex and entropy based approaches achieve high recall but poor precision, while smaller models such as RoBERTa and CodeBERT greatly improve performance (F1 = 92.70%). Proprietary models like GPT-4o perform moderately in few-shot settings (F1 = 80.13%), and fine-tuned open-source larger LLMs such as Qwen and LLaMA reach up to 94.49% F1. Finally, we also validate our approach on 178 real-world GitHub repositories, achieving an F1-score of 81.6% which demonstrates our approach's strong ability to generalize to in-the-wild scenarios.
