Table of Contents
Fetching ...

Privacy-Preserving Redaction of Diagnosis Data through Source Code Analysis

Lixi Zhou, Lei Yu, Jia Zou, Hong Min

TL;DR

This paper argues for a source code analysis approach for log redaction, and demonstrates that this approach can significantly improve the detection precision of the sensitive information and reduce both false positives and negatives.

Abstract

Protecting sensitive information in diagnostic data such as logs, is a critical concern in the industrial software diagnosis and debugging process. While there are many tools developed to automatically redact the logs for identifying and removing sensitive information, they have severe limitations which can cause either over redaction and loss of critical diagnostic information (false positives), or disclosure of sensitive information (false negatives), or both. To address the problem, in this paper, we argue for a source code analysis approach for log redaction. To identify a log message containing sensitive information, our method locates the corresponding log statement in the source code with logger code augmentation, and checks if the log statement outputs data from sensitive sources by using the data flow graph built from the source code. Appropriate redaction rules are further applied depending on the sensitiveness of the data sources to preserve the privacy information in the logs. We conducted experimental evaluation and comparison with other popular baselines. The results demonstrate that our approach can significantly improve the detection precision of the sensitive information and reduce both false positives and negatives.

Privacy-Preserving Redaction of Diagnosis Data through Source Code Analysis

TL;DR

This paper argues for a source code analysis approach for log redaction, and demonstrates that this approach can significantly improve the detection precision of the sensitive information and reduce both false positives and negatives.

Abstract

Protecting sensitive information in diagnostic data such as logs, is a critical concern in the industrial software diagnosis and debugging process. While there are many tools developed to automatically redact the logs for identifying and removing sensitive information, they have severe limitations which can cause either over redaction and loss of critical diagnostic information (false positives), or disclosure of sensitive information (false negatives), or both. To address the problem, in this paper, we argue for a source code analysis approach for log redaction. To identify a log message containing sensitive information, our method locates the corresponding log statement in the source code with logger code augmentation, and checks if the log statement outputs data from sensitive sources by using the data flow graph built from the source code. Appropriate redaction rules are further applied depending on the sensitiveness of the data sources to preserve the privacy information in the logs. We conducted experimental evaluation and comparison with other popular baselines. The results demonstrate that our approach can significantly improve the detection precision of the sensitive information and reduce both false positives and negatives.
Paper Structure (10 sections, 4 figures, 1 table)

This paper contains 10 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of proposed methodology
  • Figure 2: (a) Metadata extracted for each function by traversing its AST of the source code(b) The data flow graph built on its corresponding function definition
  • Figure 3: (a) The metadata of each function needs to be extracted after parsing the source code file. (b) After converting the source code into an abstract syntax tree (AST), the corresponding metadata is extracted by traversing the tree.
  • Figure 4: Match the log statement and its corresponding node in the DFG, to enable the backtracking to its data source node.