Table of Contents
Fetching ...

PII-Bench: Evaluating Query-Aware Privacy Protection Systems

Hao Shen, Zhouhong Gu, Haokai Hong, Weili Han

TL;DR

PII-Bench introduces a comprehensive, query-aware evaluation framework for privacy protection in LLM prompts, combining a novel query-unrelated PII masking strategy with a large, diverse dataset of 2,842 samples across 55 PII types. The framework demonstrates that while current systems achieve high basic PII detection (F1 typically above $0.90$), they struggle to determine which PII is necessary for a given query, especially in multi-subject scenarios, indicating significant room for intelligent masking. A broad experimental study shows sizeable performance gaps between large and small models and highlights the benefits and limits of advanced prompting strategies in achieving privacy-preserving yet useful LLM interactions. Overall, PII-Bench provides a concrete benchmark, methodology, and insights to guide future efforts toward robust, query-aware privacy protection for interactive language models.

Abstract

The widespread adoption of Large Language Models (LLMs) has raised significant privacy concerns regarding the exposure of personally identifiable information (PII) in user prompts. To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems. PII-Bench comprises 2,842 test samples across 55 fine-grained PII categories, featuring diverse scenarios from single-subject descriptions to complex multi-party interactions. Each sample is carefully crafted with a user query, context description, and standard answer indicating query-relevant PII. Our empirical evaluation reveals that while current models perform adequately in basic PII detection, they show significant limitations in determining PII query relevance. Even state-of-the-art LLMs struggle with this task, particularly in handling complex multi-subject scenarios, indicating substantial room for improvement in achieving intelligent PII masking.

PII-Bench: Evaluating Query-Aware Privacy Protection Systems

TL;DR

PII-Bench introduces a comprehensive, query-aware evaluation framework for privacy protection in LLM prompts, combining a novel query-unrelated PII masking strategy with a large, diverse dataset of 2,842 samples across 55 PII types. The framework demonstrates that while current systems achieve high basic PII detection (F1 typically above ), they struggle to determine which PII is necessary for a given query, especially in multi-subject scenarios, indicating significant room for intelligent masking. A broad experimental study shows sizeable performance gaps between large and small models and highlights the benefits and limits of advanced prompting strategies in achieving privacy-preserving yet useful LLM interactions. Overall, PII-Bench provides a concrete benchmark, methodology, and insights to guide future efforts toward robust, query-aware privacy protection for interactive language models.

Abstract

The widespread adoption of Large Language Models (LLMs) has raised significant privacy concerns regarding the exposure of personally identifiable information (PII) in user prompts. To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems. PII-Bench comprises 2,842 test samples across 55 fine-grained PII categories, featuring diverse scenarios from single-subject descriptions to complex multi-party interactions. Each sample is carefully crafted with a user query, context description, and standard answer indicating query-relevant PII. Our empirical evaluation reveals that while current models perform adequately in basic PII detection, they show significant limitations in determining PII query relevance. Even state-of-the-art LLMs struggle with this task, particularly in handling complex multi-subject scenarios, indicating substantial room for improvement in achieving intelligent PII masking.

Paper Structure

This paper contains 39 sections, 16 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: The overall performance of three PII Masking strategies: No Masking, All PII Masking, and Query-unrelated PII Masking. Effective Privacy Protection Systems are required to maintain LLMs' functionality while protect user's privacy as much as possible.
  • Figure 2: PII-Bench synthesis process consists of three main modules: (a) PII Entity Generation, (b) User Description Generation, and (c) Query Generation.
  • Figure 3: An example from PII-Bench, which aims to evaluate Privacy Protection System's ability by masking maximize PII while maintain LLM's functionality. The evaluation is seperated by two fundamental tasks: (a) The PII Detection Task: Identify and classify PII entities for each subject in the prompt, with ground truth labels shown on the right side. (b) The Query-Related PII Detection Task: Determine which PII entities are necessary for answering the user query, enabling selective masking of irrelevant personal information.
  • Figure 4: The performance of GPT-4o is correlated with the number of subject, the number of PII, decription length, and the number of query-related PII.
  • Figure 5: Performance comparison across different models for seven main PII types.
  • ...and 13 more figures