Table of Contents
Fetching ...

Towards Contextual Sensitive Data Detection

Liang Telkamp, Madelon Hulsebos

TL;DR

Open data portals enable data reuse but raise privacy risks; static PII detection is insufficient for context-dependent sensitivity. The authors propose two LLM-assisted mechanisms—type contextualization (detect-then-reflect) and domain contextualization (retrieve-then-detect)—to detect contextual sensitive data in tabular datasets. Type contextualization improves precision and reduces false positives, while domain contextualization grounds sensitivity in domain-specific rules, achieving high recall and improved precision, with explainable justifications in humanitarian settings. The work provides open-source code and annotated datasets to facilitate adoption and evaluation in real-world data-sharing scenarios.

Abstract

The emergence of open data portals necessitates more attention to protecting sensitive data before datasets get published and exchanged. While an abundance of methods for suppressing sensitive data exist, the conceptualization of sensitive data and methods to detect it, focus particularly on personal data that, if disclosed, may be harmful or violate privacy. We observe the need for refining and broadening our definitions of sensitive data, and argue that the sensitivity of data depends on its context. Based on this definition, we introduce two mechanisms for contextual sensitive data detection that consider the broader context of a dataset at hand. First, we introduce type contextualization, which first detects the semantic type of particular data values, then considers the overall context of the data values within the dataset or document. Second, we introduce domain contextualization which determines sensitivity of a given dataset in the broader context based on the retrieval of relevant rules from documents that specify data sensitivity (e.g., data topic and geographic origin). Experiments with these mechanisms, assisted by large language models (LLMs), confirm that: 1) type-contextualization significantly reduces the number of false positives for type-based sensitive data detection and reaches a recall of 94% compared to 63% with commercial tools, and 2) domain-contextualization leveraging sensitivity rule retrieval is effective for context-grounded sensitive data detection in non-standard data domains such as humanitarian datasets. Evaluation with humanitarian data experts also reveals that context-grounded LLM explanations provide useful guidance in manual data auditing processes, improving consistency. We open-source mechanisms and annotated datasets for contextual sensitive data detection at https://github.com/trl-lab/sensitive-data-detection.

Towards Contextual Sensitive Data Detection

TL;DR

Open data portals enable data reuse but raise privacy risks; static PII detection is insufficient for context-dependent sensitivity. The authors propose two LLM-assisted mechanisms—type contextualization (detect-then-reflect) and domain contextualization (retrieve-then-detect)—to detect contextual sensitive data in tabular datasets. Type contextualization improves precision and reduces false positives, while domain contextualization grounds sensitivity in domain-specific rules, achieving high recall and improved precision, with explainable justifications in humanitarian settings. The work provides open-source code and annotated datasets to facilitate adoption and evaluation in real-world data-sharing scenarios.

Abstract

The emergence of open data portals necessitates more attention to protecting sensitive data before datasets get published and exchanged. While an abundance of methods for suppressing sensitive data exist, the conceptualization of sensitive data and methods to detect it, focus particularly on personal data that, if disclosed, may be harmful or violate privacy. We observe the need for refining and broadening our definitions of sensitive data, and argue that the sensitivity of data depends on its context. Based on this definition, we introduce two mechanisms for contextual sensitive data detection that consider the broader context of a dataset at hand. First, we introduce type contextualization, which first detects the semantic type of particular data values, then considers the overall context of the data values within the dataset or document. Second, we introduce domain contextualization which determines sensitivity of a given dataset in the broader context based on the retrieval of relevant rules from documents that specify data sensitivity (e.g., data topic and geographic origin). Experiments with these mechanisms, assisted by large language models (LLMs), confirm that: 1) type-contextualization significantly reduces the number of false positives for type-based sensitive data detection and reaches a recall of 94% compared to 63% with commercial tools, and 2) domain-contextualization leveraging sensitivity rule retrieval is effective for context-grounded sensitive data detection in non-standard data domains such as humanitarian datasets. Evaluation with humanitarian data experts also reveals that context-grounded LLM explanations provide useful guidance in manual data auditing processes, improving consistency. We open-source mechanisms and annotated datasets for contextual sensitive data detection at https://github.com/trl-lab/sensitive-data-detection.

Paper Structure

This paper contains 20 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Data protection process, starting with the specification of sensitivity, detection sensitive data in inputs accordingly, followed by remedying sensitive data by suppression methods.
  • Figure 2: Mechanisms for detecting sensitive data in a contextual manner.

Theorems & Definitions (1)

  • definition thmcounterdefinition: Contextual Sensitive Data