Table of Contents
Fetching ...

SDLog: A Deep Learning Framework for Detecting Sensitive Information in Software Logs

Roozbeh Aghili, Xingfang Wu, Foutse Khomh, Heng Li

TL;DR

SDLog tackles the privacy risks in software logs by replacing regex-based anonymization with a deep learning approach. It formulates sensitive attribute detection as a token-level NER task using CodeBERT, introducing a two-model hierarchy to first detect broad sensitive spans and then disambiguate network-related sub-attributes. Across 16 LogHub datasets totaling 32,000 labeled log lines, SDLog outperforms the best regex patterns, achieving an overall F1 of 92.9% and near-perfect performance on network sub-attributes; fine-tuning on as few as 100 target logs further boosts precision and recall to around 99% or higher for most categories. The work demonstrates practical deployability via public models and scripts, underscoring the method’s robustness, generalization benefits, andEase of integration into real-world log anonymization pipelines.

Abstract

Software logs are messages recorded during the execution of a software system that provide crucial run-time information about events and activities. Although software logs have a critical role in software maintenance and operation tasks, publicly accessible log datasets remain limited, hindering advance in log analysis research and practices. The presence of sensitive information, particularly Personally Identifiable Information (PII) and quasi-identifiers, introduces serious privacy and re-identification risks, discouraging the publishing and sharing of real-world logs. In practice, log anonymization techniques primarily rely on regular expression patterns, which involve manually crafting rules to identify and replace sensitive information. However, these regex-based approaches suffer from significant limitations, such as extensive manual efforts and poor generalizability across diverse log formats and datasets. To mitigate these limitations, we introduce SDLog, a deep learning-based framework designed to identify sensitive information in software logs. Our results show that SDLog overcomes regex limitations and outperforms the best-performing regex patterns in identifying sensitive information. With only 100 fine-tuning samples from the target dataset, SDLog can correctly identify 99.5% of sensitive attributes and achieves an F1-score of 98.4%. To the best of our knowledge, this is the first deep learning alternative to regex-based methods in software log anonymization.

SDLog: A Deep Learning Framework for Detecting Sensitive Information in Software Logs

TL;DR

SDLog tackles the privacy risks in software logs by replacing regex-based anonymization with a deep learning approach. It formulates sensitive attribute detection as a token-level NER task using CodeBERT, introducing a two-model hierarchy to first detect broad sensitive spans and then disambiguate network-related sub-attributes. Across 16 LogHub datasets totaling 32,000 labeled log lines, SDLog outperforms the best regex patterns, achieving an overall F1 of 92.9% and near-perfect performance on network sub-attributes; fine-tuning on as few as 100 target logs further boosts precision and recall to around 99% or higher for most categories. The work demonstrates practical deployability via public models and scripts, underscoring the method’s robustness, generalization benefits, andEase of integration into real-world log anonymization pipelines.

Abstract

Software logs are messages recorded during the execution of a software system that provide crucial run-time information about events and activities. Although software logs have a critical role in software maintenance and operation tasks, publicly accessible log datasets remain limited, hindering advance in log analysis research and practices. The presence of sensitive information, particularly Personally Identifiable Information (PII) and quasi-identifiers, introduces serious privacy and re-identification risks, discouraging the publishing and sharing of real-world logs. In practice, log anonymization techniques primarily rely on regular expression patterns, which involve manually crafting rules to identify and replace sensitive information. However, these regex-based approaches suffer from significant limitations, such as extensive manual efforts and poor generalizability across diverse log formats and datasets. To mitigate these limitations, we introduce SDLog, a deep learning-based framework designed to identify sensitive information in software logs. Our results show that SDLog overcomes regex limitations and outperforms the best-performing regex patterns in identifying sensitive information. With only 100 fine-tuning samples from the target dataset, SDLog can correctly identify 99.5% of sensitive attributes and achieves an F1-score of 98.4%. To the best of our knowledge, this is the first deep learning alternative to regex-based methods in software log anonymization.

Paper Structure

This paper contains 40 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: A logging statement example in Python.
  • Figure 2: An example of our data annotation process
  • Figure 3: An overall diagram of SDLog framework