Table of Contents
Fetching ...

Cross-Domain Transfer and Few-Shot Learning for Personal Identifiable Information Recognition

Junhong Ye, Xu Yuan, Xinying Qiu

TL;DR

PII recognition for automated text anonymization faces domain variability and privacy constraints. The authors conduct a large-scale, cross-domain study across healthcare (I2B2), legal (TAB), and biographical (Wikipedia) texts, comparing domain-tuned transformers, pretrained NER models, and GPT-3.5-turbo across 231 configurations to assess cross-domain transfer, data fusion, and sample efficiency. They find Longformer-based domain-tuned models deliver the strongest performance, cross-domain transfer is highly asymmetric (medical data being the hardest target), and fusion benefits are domain-specific, while substantial performance can be achieved with reduced labeled data in non-specialized domains. These results provide practical guidance for building robust PII recognizers and point to future work on expanding evaluations to conversational data, refining fusion strategies, and leveraging synthetic data for few-shot learning.

Abstract

Accurate recognition of personally identifiable information (PII) is central to automated text anonymization. This paper investigates the effectiveness of cross-domain model transfer, multi-domain data fusion, and sample-efficient learning for PII recognition. Using annotated corpora from healthcare (I2B2), legal (TAB), and biography (Wikipedia), we evaluate models across four dimensions: in-domain performance, cross-domain transferability, fusion, and few-shot learning. Results show legal-domain data transfers well to biographical texts, while medical domains resist incoming transfer. Fusion benefits are domain-specific, and high-quality recognition is achievable with only 10% of training data in low-specialization domains.

Cross-Domain Transfer and Few-Shot Learning for Personal Identifiable Information Recognition

TL;DR

PII recognition for automated text anonymization faces domain variability and privacy constraints. The authors conduct a large-scale, cross-domain study across healthcare (I2B2), legal (TAB), and biographical (Wikipedia) texts, comparing domain-tuned transformers, pretrained NER models, and GPT-3.5-turbo across 231 configurations to assess cross-domain transfer, data fusion, and sample efficiency. They find Longformer-based domain-tuned models deliver the strongest performance, cross-domain transfer is highly asymmetric (medical data being the hardest target), and fusion benefits are domain-specific, while substantial performance can be achieved with reduced labeled data in non-specialized domains. These results provide practical guidance for building robust PII recognizers and point to future work on expanding evaluations to conversational data, refining fusion strategies, and leveraging synthetic data for few-shot learning.

Abstract

Accurate recognition of personally identifiable information (PII) is central to automated text anonymization. This paper investigates the effectiveness of cross-domain model transfer, multi-domain data fusion, and sample-efficient learning for PII recognition. Using annotated corpora from healthcare (I2B2), legal (TAB), and biography (Wikipedia), we evaluate models across four dimensions: in-domain performance, cross-domain transferability, fusion, and few-shot learning. Results show legal-domain data transfers well to biographical texts, while medical domains resist incoming transfer. Fusion benefits are domain-specific, and high-quality recognition is achievable with only 10% of training data in low-specialization domains.

Paper Structure

This paper contains 13 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Research Framework