CAPID: Context-Aware PII Detection for Question-Answering Systems

Mariia Ponomarenko; Sepideh Abedini; Masoumeh Shafieinejad; D. B. Emerson; Shubhankar Mohapatra; Xi He

CAPID: Context-Aware PII Detection for Question-Answering Systems

Mariia Ponomarenko, Sepideh Abedini, Masoumeh Shafieinejad, D. B. Emerson, Shubhankar Mohapatra, Xi He

TL;DR

CAPID addresses privacy in QA by introducing a context-aware PII detection framework that retains only PII spans relevant to the user's question. It uses a synthetic data generation pipeline to train small-language-models to detect PII spans, classify their types, and estimate contextual relevance, enabling privacy-preserving QA with external LLMs. Empirical results show substantial improvements in span, type, and relevance accuracy over baselines and better downstream utility when high-relevance PII is preserved. The work also provides open-source data and models, highlighting practical implications for privacy-conscious QA systems.

Abstract

Detecting personally identifiable information (PII) in user queries is critical for ensuring privacy in question-answering systems. Current approaches mainly redact all PII, disregarding the fact that some of them may be contextually relevant to the user's question, resulting in a degradation of response quality. Large language models (LLMs) might be able to help determine which PII are relevant, but due to their closed source nature and lack of privacy guarantees, they are unsuitable for sensitive data processing. To achieve privacy-preserving PII detection, we propose CAPID, a practical approach that fine-tunes a locally owned small language model (SLM) that filters sensitive information before it is passed to LLMs for QA. However, existing datasets do not capture the context-dependent relevance of PII needed to train such a model effectively. To fill this gap, we propose a synthetic data generation pipeline that leverages LLMs to produce a diverse, domain-rich dataset spanning multiple PII types and relevance levels. Using this dataset, we fine-tune an SLM to detect PII spans, classify their types, and estimate contextual relevance. Our experiments show that relevance-aware PII detection with a fine-tuned SLM substantially outperforms existing baselines in span, relevance and type accuracy while preserving significantly higher downstream utility under anonymization.

CAPID: Context-Aware PII Detection for Question-Answering Systems

TL;DR

Abstract

Paper Structure (35 sections, 4 equations, 1 figure, 6 tables)

This paper contains 35 sections, 4 equations, 1 figure, 6 tables.

Introduction
Related Work
Problem Statement
CAPID
Topics Generation
PII, Context and Question Generation
Context Enhancement
Data Validation
Evaluation
Model Training Performance
Downstream Performance
Evaluation on Reddit Data
Utility Analysis
Conclusion
Generation Configuration
...and 20 more sections

Figures (1)

Figure 1: The three-stage sequential pipeline for generating the dataset. Stage 1: Topics Generation, which conditions the LLM for subsequent sampling. Stage 2: PII, Context and Questionv Generation, involving sample-wise decomposition to create a context containing both relevant and irrelevant PII, followed by situational question formulation. Stage 3: Optimization for Relevance and Coherence, where various techniques are applied to augment the contextual data.

Theorems & Definitions (1)

Example 1

CAPID: Context-Aware PII Detection for Question-Answering Systems

TL;DR

Abstract

CAPID: Context-Aware PII Detection for Question-Answering Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (1)

Theorems & Definitions (1)