Table of Contents
Fetching ...

A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts

Prabigya Acharya, Liza Shrestha

TL;DR

The study benchmarks encoder-decoder (T5-small) and decoder-only (Mistral-7B) architectures for PII masking on English ai4privacy data, introducing dataset normalization and three variants to test label consistency and PII representations. Fine-tuning with prompt-based sequences and LoRA PEFT enables competitive performance against frontier LLMs, with Mistral achieving higher span-detection and label-exact accuracy but at a cost of latency. Real-world evaluation via a Discord bot reveals a trade-off: T5-small offers faster inference suitable for real-time use but struggles with informal, noisy input, while Mistral-7B demonstrates robustness yet incurs prohibitive latency for synchronous deployment. Overall, the results show lightweight models can provide effective PII masking with practical data-handling advantages, enablingprivacy-preserving deployment under varying latency requirements.

Abstract

Automated masking of Personally Identifiable Information (PII) is critical for privacy-preserving conversational systems. While current frontier large language models demonstrate strong PII masking capabilities, concerns about data handling and computational costs motivate exploration of whether lightweight models can achieve comparable performance. We compare encoder-decoder and decoder-only architectures by fine-tuning T5-small and Mistral-Instruct-v0.3 on English datasets constructed from the AI4Privacy benchmark. We create different dataset variants to study label standardization and PII representation, covering 24 standardized PII categories and higher-granularity settings. Evaluation using entity-level and character-level metrics, type accuracy, and exact match shows that both lightweight models achieve performance comparable to frontier LLMs for PII masking tasks. Label normalization consistently improves performance across architectures. Mistral achieves higher F1 and recall with greater robustness across PII types but incurs significantly higher generation latency. T5, while less robust in conversational text, offers more controllable structured outputs and lower inference cost, motivating its use in a real-time Discord bot for real-world PII redaction. Evaluation on live messages reveals performance degradation under informal inputs. These results clarify trade-offs between accuracy, robustness, and computational efficiency, demonstrating that lightweight models can provide effective PII masking while addressing data handling concerns associated with frontier LLMs.

A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts

TL;DR

The study benchmarks encoder-decoder (T5-small) and decoder-only (Mistral-7B) architectures for PII masking on English ai4privacy data, introducing dataset normalization and three variants to test label consistency and PII representations. Fine-tuning with prompt-based sequences and LoRA PEFT enables competitive performance against frontier LLMs, with Mistral achieving higher span-detection and label-exact accuracy but at a cost of latency. Real-world evaluation via a Discord bot reveals a trade-off: T5-small offers faster inference suitable for real-time use but struggles with informal, noisy input, while Mistral-7B demonstrates robustness yet incurs prohibitive latency for synchronous deployment. Overall, the results show lightweight models can provide effective PII masking with practical data-handling advantages, enablingprivacy-preserving deployment under varying latency requirements.

Abstract

Automated masking of Personally Identifiable Information (PII) is critical for privacy-preserving conversational systems. While current frontier large language models demonstrate strong PII masking capabilities, concerns about data handling and computational costs motivate exploration of whether lightweight models can achieve comparable performance. We compare encoder-decoder and decoder-only architectures by fine-tuning T5-small and Mistral-Instruct-v0.3 on English datasets constructed from the AI4Privacy benchmark. We create different dataset variants to study label standardization and PII representation, covering 24 standardized PII categories and higher-granularity settings. Evaluation using entity-level and character-level metrics, type accuracy, and exact match shows that both lightweight models achieve performance comparable to frontier LLMs for PII masking tasks. Label normalization consistently improves performance across architectures. Mistral achieves higher F1 and recall with greater robustness across PII types but incurs significantly higher generation latency. T5, while less robust in conversational text, offers more controllable structured outputs and lower inference cost, motivating its use in a real-time Discord bot for real-world PII redaction. Evaluation on live messages reveals performance degradation under informal inputs. These results clarify trade-offs between accuracy, robustness, and computational efficiency, demonstrating that lightweight models can provide effective PII masking while addressing data handling concerns associated with frontier LLMs.

Paper Structure

This paper contains 25 sections, 1 equation, 1 figure, 10 tables.

Figures (1)

  • Figure 1: Discord bot deployment examples demonstrating the model’s masking behavior. For each interaction, the original user message (a) and the corresponding masked response (b) are shown.