Table of Contents
Fetching ...

Comparing Feature-based and Context-aware Approaches to PII Generalization Level Prediction

Kailin Zhang, Xinying Qiu

TL;DR

This work tackles PII generalization for privacy-preserving text by introducing a feature-based ensemble method for structured inputs and a novel context-aware framework that leverages contextual information via Multilingual-BERT, functional transformations, and mean squared error scoring. On the WikiReplace dataset, the context-aware approach consistently outperforms the feature-based method, highlighting the importance of contextual and semantic relationships in generalization decisions. Key contributions include demonstrating effectiveness of feature engineering with ensemble learning for structured PII inputs, proposing a scalable context-aware pipeline, and providing a thorough comparative analysis across dataset scales. The findings have practical significance for improving PII generalization in real-world text anonymization systems while balancing privacy risk and data utility.

Abstract

Protecting Personal Identifiable Information (PII) in text data is crucial for privacy, but current PII generalization methods face challenges such as uneven data distributions and limited context awareness. To address these issues, we propose two approaches: a feature-based method using machine learning to improve performance on structured inputs, and a novel context-aware framework that considers the broader context and semantic relationships between the original text and generalized candidates. The context-aware approach employs Multilingual-BERT for text representation, functional transformations, and mean squared error scoring to evaluate candidates. Experiments on the WikiReplace dataset demonstrate the effectiveness of both methods, with the context-aware approach outperforming the feature-based one across different scales. This work contributes to advancing PII generalization techniques by highlighting the importance of feature selection, ensemble learning, and incorporating contextual information for better privacy protection in text anonymization.

Comparing Feature-based and Context-aware Approaches to PII Generalization Level Prediction

TL;DR

This work tackles PII generalization for privacy-preserving text by introducing a feature-based ensemble method for structured inputs and a novel context-aware framework that leverages contextual information via Multilingual-BERT, functional transformations, and mean squared error scoring. On the WikiReplace dataset, the context-aware approach consistently outperforms the feature-based method, highlighting the importance of contextual and semantic relationships in generalization decisions. Key contributions include demonstrating effectiveness of feature engineering with ensemble learning for structured PII inputs, proposing a scalable context-aware pipeline, and providing a thorough comparative analysis across dataset scales. The findings have practical significance for improving PII generalization in real-world text anonymization systems while balancing privacy risk and data utility.

Abstract

Protecting Personal Identifiable Information (PII) in text data is crucial for privacy, but current PII generalization methods face challenges such as uneven data distributions and limited context awareness. To address these issues, we propose two approaches: a feature-based method using machine learning to improve performance on structured inputs, and a novel context-aware framework that considers the broader context and semantic relationships between the original text and generalized candidates. The context-aware approach employs Multilingual-BERT for text representation, functional transformations, and mean squared error scoring to evaluate candidates. Experiments on the WikiReplace dataset demonstrate the effectiveness of both methods, with the context-aware approach outperforming the feature-based one across different scales. This work contributes to advancing PII generalization techniques by highlighting the importance of feature selection, ensemble learning, and incorporating contextual information for better privacy protection in text anonymization.
Paper Structure (20 sections, 6 equations, 5 figures, 8 tables)

This paper contains 20 sections, 6 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Comparing Original Model and our proposed Feature-based Model
  • Figure 2: Our proposed Context-aware model
  • Figure 3: Distribution of number of generalizations and selected level
  • Figure 4: Majority-vote Accuracy of machine learning models
  • Figure 5: Confusion matrices of Context-aware Method on different scales of dataset, where $C$ is the maximum number of candidates. The numbers in diagonal are the true positive rates.