An Evaluation Study of Hybrid Methods for Multilingual PII Detection
Harshit Rajgarhia, Suryam Gupta, Asif Shaik, Gulipalli Praveen Kumar, Y Santhoshraj, Sanka Nithya Tanvy Nishitha, Abhishek Mukherji
TL;DR
RECAP addresses PII detection in low-resource languages by fusing deterministic regex with context-aware prompting of large language models to detect 300+ PII types across 13 locales without model fine-tuning. The architecture uses per-locale detectors and a three-phase refinement pipeline to resolve multi-labels, consolidate overlapping spans, and filter contextual false positives. Benchmarking with nervaluate against strong baselines shows RECAP outperforms fine-tuned NER by 82% and zero-shot LLMs by 17% in weighted F1-score, with notable recall gains. The work offers a scalable, adaptable solution for privacy-compliance tasks, while acknowledging limitations such as reliance on a single LLM and synthetic benchmarks, and outlining directions for prompt optimization, robustness, and on-device deployment.
Abstract
The detection of Personally Identifiable Information (PII) is critical for privacy compliance but remains challenging in low-resource languages due to linguistic diversity and limited annotated data. We present RECAP, a hybrid framework that combines deterministic regular expressions with context-aware large language models (LLMs) for scalable PII detection across 13 low-resource locales. RECAP's modular design supports over 300 entity types without retraining, using a three-phase refinement pipeline for disambiguation and filtering. Benchmarked with nervaluate, our system outperforms fine-tuned NER models by 82% and zero-shot LLMs by 17% in weighted F1-score. This work offers a scalable and adaptable solution for efficient PII detection in compliance-focused applications.
