PrivaCI-Bench: Evaluating Privacy with Contextual Integrity and Legal Compliance
Haoran Li, Wenbin Hu, Huihao Jing, Yulin Chen, Qi Hu, Sirui Han, Tianshu Chu, Peizhao Hu, Yangqiu Song
TL;DR
PrivaCI-Bench presents a comprehensive, legally-grounded benchmark for evaluating privacy with Contextual Integrity across real court cases, privacy policies, and EU AI Act–driven synthetic data. It combines structured CI parameter extraction, expansive auxiliary knowledge bases, and large MCQ probes to assess LLMs' ability to understand private information flows and comply with regulations. Across multiple open- and closed-source models, results show that while CI cues improve privacy compliance, standard prompting and retrieval methods are not universally beneficial, underscoring the need for domain-tailored reasoning modules and improved grounding. The work advances privacy evaluation beyond PII and highlights practical gaps in current LLMs’ regulatory reasoning, with implications for safer deployment in privacy-sensitive applications.
Abstract
Recent advancements in generative large language models (LLMs) have enabled wider applicability, accessibility, and flexibility. However, their reliability and trustworthiness are still in doubt, especially for concerns regarding individuals' data privacy. Great efforts have been made on privacy by building various evaluation benchmarks to study LLMs' privacy awareness and robustness from their generated outputs to their hidden representations. Unfortunately, most of these works adopt a narrow formulation of privacy and only investigate personally identifiable information (PII). In this paper, we follow the merit of the Contextual Integrity (CI) theory, which posits that privacy evaluation should not only cover the transmitted attributes but also encompass the whole relevant social context through private information flows. We present PrivaCI-Bench, a comprehensive contextual privacy evaluation benchmark targeted at legal compliance to cover well-annotated privacy and safety regulations, real court cases, privacy policies, and synthetic data built from the official toolkit to study LLMs' privacy and safety compliance. We evaluate the latest LLMs, including the recent reasoner models QwQ-32B and Deepseek R1. Our experimental results suggest that though LLMs can effectively capture key CI parameters inside a given context, they still require further advancements for privacy compliance.
