Table of Contents
Fetching ...

An Evaluation Study of Hybrid Methods for Multilingual PII Detection

Harshit Rajgarhia, Suryam Gupta, Asif Shaik, Gulipalli Praveen Kumar, Y Santhoshraj, Sanka Nithya Tanvy Nishitha, Abhishek Mukherji

TL;DR

RECAP addresses PII detection in low-resource languages by fusing deterministic regex with context-aware prompting of large language models to detect 300+ PII types across 13 locales without model fine-tuning. The architecture uses per-locale detectors and a three-phase refinement pipeline to resolve multi-labels, consolidate overlapping spans, and filter contextual false positives. Benchmarking with nervaluate against strong baselines shows RECAP outperforms fine-tuned NER by 82% and zero-shot LLMs by 17% in weighted F1-score, with notable recall gains. The work offers a scalable, adaptable solution for privacy-compliance tasks, while acknowledging limitations such as reliance on a single LLM and synthetic benchmarks, and outlining directions for prompt optimization, robustness, and on-device deployment.

Abstract

The detection of Personally Identifiable Information (PII) is critical for privacy compliance but remains challenging in low-resource languages due to linguistic diversity and limited annotated data. We present RECAP, a hybrid framework that combines deterministic regular expressions with context-aware large language models (LLMs) for scalable PII detection across 13 low-resource locales. RECAP's modular design supports over 300 entity types without retraining, using a three-phase refinement pipeline for disambiguation and filtering. Benchmarked with nervaluate, our system outperforms fine-tuned NER models by 82% and zero-shot LLMs by 17% in weighted F1-score. This work offers a scalable and adaptable solution for efficient PII detection in compliance-focused applications.

An Evaluation Study of Hybrid Methods for Multilingual PII Detection

TL;DR

RECAP addresses PII detection in low-resource languages by fusing deterministic regex with context-aware prompting of large language models to detect 300+ PII types across 13 locales without model fine-tuning. The architecture uses per-locale detectors and a three-phase refinement pipeline to resolve multi-labels, consolidate overlapping spans, and filter contextual false positives. Benchmarking with nervaluate against strong baselines shows RECAP outperforms fine-tuned NER by 82% and zero-shot LLMs by 17% in weighted F1-score, with notable recall gains. The work offers a scalable, adaptable solution for privacy-compliance tasks, while acknowledging limitations such as reliance on a single LLM and synthetic benchmarks, and outlining directions for prompt optimization, robustness, and on-device deployment.

Abstract

The detection of Personally Identifiable Information (PII) is critical for privacy compliance but remains challenging in low-resource languages due to linguistic diversity and limited annotated data. We present RECAP, a hybrid framework that combines deterministic regular expressions with context-aware large language models (LLMs) for scalable PII detection across 13 low-resource locales. RECAP's modular design supports over 300 entity types without retraining, using a three-phase refinement pipeline for disambiguation and filtering. Benchmarked with nervaluate, our system outperforms fine-tuned NER models by 82% and zero-shot LLMs by 17% in weighted F1-score. This work offers a scalable and adaptable solution for efficient PII detection in compliance-focused applications.

Paper Structure

This paper contains 11 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: RECAP Architecture
  • Figure 2: Multi-labeling (top) and False Positives (bottom) detection problem and resolution
  • Figure 3: F1 Scores by Approach and Locale
  • Figure 4: Pretrained model codes used for multilingual NER.