Table of Contents
Fetching ...

Secret Breach Prevention in Software Issue Reports

Sadif Ahmed, Md Nafiu Rahman, Zahin Wahab, Gias Uddin, Rifat Shahriyar

TL;DR

The paper tackles the overlooked problem of accidental secret leakage in software issue reports and introduces a large-scale benchmark of 54,148 labeled instances. It proposes a detection pipeline that combines regex-based candidate extraction with contextual classification by both proprietary and open-source language models, demonstrating substantial gains over traditional regex/entropy baselines. Key findings show small Bert-like models and open-source LLMs achieving up to 94% F1 on the task, while GPT-4o offers strong but more modest performance in prompting settings, and real-world repos yield an 81.6% macro F1, indicating strong generalization. The work provides a practical path to deploying automated secret detection in issue trackers and highlights remaining challenges, such as handling ambiguous placeholders and long cryptographic material, while offering replication-ready data and models for further research.

Abstract

In the digital era, accidental exposure of sensitive information such as API keys, tokens, and credentials is a growing security threat. While most prior work focuses on detecting secrets in source code, leakage in software issue reports remains largely unexplored. This study fills that gap through a large-scale analysis and a practical detection pipeline for exposed secrets in GitHub issues. Our pipeline combines regular expression-based extraction with large language model (LLM) based contextual classification to detect real secrets and reduce false positives. We build a benchmark of 54,148 instances from public GitHub issues, including 5,881 manually verified true secrets. Using this dataset, we evaluate entropy-based baselines and keyword heuristics used by prior secret detection tools, classical machine learning, deep learning, and LLM-based methods. Regex and entropy based approaches achieve high recall but poor precision, while smaller models such as RoBERTa and CodeBERT greatly improve performance (F1 = 92.70%). Proprietary models like GPT-4o perform moderately in few-shot settings (F1 = 80.13%), and fine-tuned open-source larger LLMs such as Qwen and LLaMA reach up to 94.49% F1. Finally, we also validate our approach on 178 real-world GitHub repositories, achieving an F1-score of 81.6% which demonstrates our approach's strong ability to generalize to in-the-wild scenarios.

Secret Breach Prevention in Software Issue Reports

TL;DR

The paper tackles the overlooked problem of accidental secret leakage in software issue reports and introduces a large-scale benchmark of 54,148 labeled instances. It proposes a detection pipeline that combines regex-based candidate extraction with contextual classification by both proprietary and open-source language models, demonstrating substantial gains over traditional regex/entropy baselines. Key findings show small Bert-like models and open-source LLMs achieving up to 94% F1 on the task, while GPT-4o offers strong but more modest performance in prompting settings, and real-world repos yield an 81.6% macro F1, indicating strong generalization. The work provides a practical path to deploying automated secret detection in issue trackers and highlights remaining challenges, such as handling ambiguous placeholders and long cryptographic material, while offering replication-ready data and models for further research.

Abstract

In the digital era, accidental exposure of sensitive information such as API keys, tokens, and credentials is a growing security threat. While most prior work focuses on detecting secrets in source code, leakage in software issue reports remains largely unexplored. This study fills that gap through a large-scale analysis and a practical detection pipeline for exposed secrets in GitHub issues. Our pipeline combines regular expression-based extraction with large language model (LLM) based contextual classification to detect real secrets and reduce false positives. We build a benchmark of 54,148 instances from public GitHub issues, including 5,881 manually verified true secrets. Using this dataset, we evaluate entropy-based baselines and keyword heuristics used by prior secret detection tools, classical machine learning, deep learning, and LLM-based methods. Regex and entropy based approaches achieve high recall but poor precision, while smaller models such as RoBERTa and CodeBERT greatly improve performance (F1 = 92.70%). Proprietary models like GPT-4o perform moderately in few-shot settings (F1 = 80.13%), and fine-tuned open-source larger LLMs such as Qwen and LLaMA reach up to 94.49% F1. Finally, we also validate our approach on 178 real-world GitHub repositories, achieving an F1-score of 81.6% which demonstrates our approach's strong ability to generalize to in-the-wild scenarios.

Paper Structure

This paper contains 53 sections, 5 figures, 18 tables.

Figures (5)

  • Figure 1: Example of secret leak in GitHub issue report (Here we masked the actual key with a dummy key for security purposes)
  • Figure 2: Trend of secret leaks over time (2016–2025).
  • Figure 3: Prompt for the classification model.
  • Figure 4: Workflow for secret detection in issue reports. Candidate strings are extracted using 761 regular expressions and a 200-character context window is created around each. Human-labeled samples are used to fine-tune a language model during training. In inference, the same extraction and context generation steps are applied, and the fine-tuned model classifies each candidate as secret or non-sensitive.
  • Figure 5: Qwen-7B confusion matrix for secret detection.