Table of Contents
Fetching ...

Using AI/ML to Find and Remediate Enterprise Secrets in Code & Document Sharing Platforms

Gregor Kerr, David Algorry, Senad Ibraimoski, Peter Maciver, Sean Moran

TL;DR

The paper tackles the problem of accidental secret leakage in codebases and document-sharing platforms, where traditional regex-based tools produce excessive false positives. It proposes AI/ML-based detection pipelines for secrets in code and Confluence DSPs, coupled with an automated remediation pathway using OpenRewrite and a human-in-the-loop labeling workflow. Two baselines are developed: a code model using entropy and file-extension features with logistic regression, and a Confluence model using TF-IDF features and XGBoost trained on weak labels with SME relabeling. Results show improved recall and substantially reduced false positives in code, along with high recall (0.97) but moderate precision in Confluence, indicating practical potential and a path toward scalable remediation with future integration of larger language models.

Abstract

We introduce a new challenge to the software development community: 1) leveraging AI to accurately detect and flag up secrets in code and on popular document sharing platforms that frequently used by developers, such as Confluence and 2) automatically remediating the detections (e.g. by suggesting password vault functionality). This is a challenging, and mostly unaddressed task. Existing methods leverage heuristics and regular expressions, that can be very noisy, and therefore increase toil on developers. The next step - modifying code itself - to automatically remediate a detection, is a complex task. We introduce two baseline AI models that have good detection performance and propose an automatic mechanism for remediating secrets found in code, opening up the study of this task to the wider community.

Using AI/ML to Find and Remediate Enterprise Secrets in Code & Document Sharing Platforms

TL;DR

The paper tackles the problem of accidental secret leakage in codebases and document-sharing platforms, where traditional regex-based tools produce excessive false positives. It proposes AI/ML-based detection pipelines for secrets in code and Confluence DSPs, coupled with an automated remediation pathway using OpenRewrite and a human-in-the-loop labeling workflow. Two baselines are developed: a code model using entropy and file-extension features with logistic regression, and a Confluence model using TF-IDF features and XGBoost trained on weak labels with SME relabeling. Results show improved recall and substantially reduced false positives in code, along with high recall (0.97) but moderate precision in Confluence, indicating practical potential and a path toward scalable remediation with future integration of larger language models.

Abstract

We introduce a new challenge to the software development community: 1) leveraging AI to accurately detect and flag up secrets in code and on popular document sharing platforms that frequently used by developers, such as Confluence and 2) automatically remediating the detections (e.g. by suggesting password vault functionality). This is a challenging, and mostly unaddressed task. Existing methods leverage heuristics and regular expressions, that can be very noisy, and therefore increase toil on developers. The next step - modifying code itself - to automatically remediate a detection, is a complex task. We introduce two baseline AI models that have good detection performance and propose an automatic mechanism for remediating secrets found in code, opening up the study of this task to the wider community.
Paper Structure (6 sections, 2 tables)