Table of Contents
Fetching ...

PrivaCI-Bench: Evaluating Privacy with Contextual Integrity and Legal Compliance

Haoran Li, Wenbin Hu, Huihao Jing, Yulin Chen, Qi Hu, Sirui Han, Tianshu Chu, Peizhao Hu, Yangqiu Song

TL;DR

PrivaCI-Bench presents a comprehensive, legally-grounded benchmark for evaluating privacy with Contextual Integrity across real court cases, privacy policies, and EU AI Act–driven synthetic data. It combines structured CI parameter extraction, expansive auxiliary knowledge bases, and large MCQ probes to assess LLMs' ability to understand private information flows and comply with regulations. Across multiple open- and closed-source models, results show that while CI cues improve privacy compliance, standard prompting and retrieval methods are not universally beneficial, underscoring the need for domain-tailored reasoning modules and improved grounding. The work advances privacy evaluation beyond PII and highlights practical gaps in current LLMs’ regulatory reasoning, with implications for safer deployment in privacy-sensitive applications.

Abstract

Recent advancements in generative large language models (LLMs) have enabled wider applicability, accessibility, and flexibility. However, their reliability and trustworthiness are still in doubt, especially for concerns regarding individuals' data privacy. Great efforts have been made on privacy by building various evaluation benchmarks to study LLMs' privacy awareness and robustness from their generated outputs to their hidden representations. Unfortunately, most of these works adopt a narrow formulation of privacy and only investigate personally identifiable information (PII). In this paper, we follow the merit of the Contextual Integrity (CI) theory, which posits that privacy evaluation should not only cover the transmitted attributes but also encompass the whole relevant social context through private information flows. We present PrivaCI-Bench, a comprehensive contextual privacy evaluation benchmark targeted at legal compliance to cover well-annotated privacy and safety regulations, real court cases, privacy policies, and synthetic data built from the official toolkit to study LLMs' privacy and safety compliance. We evaluate the latest LLMs, including the recent reasoner models QwQ-32B and Deepseek R1. Our experimental results suggest that though LLMs can effectively capture key CI parameters inside a given context, they still require further advancements for privacy compliance.

PrivaCI-Bench: Evaluating Privacy with Contextual Integrity and Legal Compliance

TL;DR

PrivaCI-Bench presents a comprehensive, legally-grounded benchmark for evaluating privacy with Contextual Integrity across real court cases, privacy policies, and EU AI Act–driven synthetic data. It combines structured CI parameter extraction, expansive auxiliary knowledge bases, and large MCQ probes to assess LLMs' ability to understand private information flows and comply with regulations. Across multiple open- and closed-source models, results show that while CI cues improve privacy compliance, standard prompting and retrieval methods are not universally beneficial, underscoring the need for domain-tailored reasoning modules and improved grounding. The work advances privacy evaluation beyond PII and highlights practical gaps in current LLMs’ regulatory reasoning, with implications for safer deployment in privacy-sensitive applications.

Abstract

Recent advancements in generative large language models (LLMs) have enabled wider applicability, accessibility, and flexibility. However, their reliability and trustworthiness are still in doubt, especially for concerns regarding individuals' data privacy. Great efforts have been made on privacy by building various evaluation benchmarks to study LLMs' privacy awareness and robustness from their generated outputs to their hidden representations. Unfortunately, most of these works adopt a narrow formulation of privacy and only investigate personally identifiable information (PII). In this paper, we follow the merit of the Contextual Integrity (CI) theory, which posits that privacy evaluation should not only cover the transmitted attributes but also encompass the whole relevant social context through private information flows. We present PrivaCI-Bench, a comprehensive contextual privacy evaluation benchmark targeted at legal compliance to cover well-annotated privacy and safety regulations, real court cases, privacy policies, and synthetic data built from the official toolkit to study LLMs' privacy and safety compliance. We evaluate the latest LLMs, including the recent reasoner models QwQ-32B and Deepseek R1. Our experimental results suggest that though LLMs can effectively capture key CI parameters inside a given context, they still require further advancements for privacy compliance.

Paper Structure

This paper contains 42 sections, 2 figures, 16 tables.

Figures (2)

  • Figure 1: The workflow of our proposed PrivaCI-Bench. We decompose the transmission principle into multiple factors such as "Purpose" and "Consent". Given collected legal documents and court cases, we parse their CI parameters via ➀, ➁ and ➃. Then, auxiliary knowledge bases are created in ➂ by creating hierarchical knowledge graphs about roles and attributes. With the help of auxiliary knowledge bases, we may ground the case's contextual parameters to match the applicable regulations in ➄. Lastly, we may implement various in-context reasoning modules in ➅ to determine if the case meets existing privacy standards.
  • Figure 2: Ablation studies for the legal compliance task. All results are evaluated in %.