Table of Contents
Fetching ...

Privacy Checklist: Privacy Violation Detection Grounding on Contextual Integrity Theory

Haoran Li, Wei Fan, Yulin Chen, Jiayang Cheng, Tianshu Chu, Xuebing Zhou, Peizhao Hu, Yangqiu Song

TL;DR

This work reframes privacy evaluation as context-aware reasoning grounded in Contextual Integrity, using a Privacy Checklist to model normative information flows within HIPAA. It constructs a scalable knowledge base comprising a HIPAA-centric document tree, CI-characteristics, role/attribute graphs, and a definition dictionary to enable in-context reasoning with LLMs and retrieval-augmented approaches. Through retrieval strategies (BM25, embedding similarity, and agent-based methods) and curated prompting (including CoT variants), the approach improves LLM privacy judgments on real court data, achieving notable accuracy gains. The study highlights both the potential and current limitations of LLMs for legal-privacy judgments, outlining future directions to broaden regulation coverage and enhance retrieval fidelity.

Abstract

Privacy research has attracted wide attention as individuals worry that their private data can be easily leaked during interactions with smart devices, social platforms, and AI applications. Computer science researchers, on the other hand, commonly study privacy issues through privacy attacks and defenses on segmented fields. Privacy research is conducted on various sub-fields, including Computer Vision (CV), Natural Language Processing (NLP), and Computer Networks. Within each field, privacy has its own formulation. Though pioneering works on attacks and defenses reveal sensitive privacy issues, they are narrowly trapped and cannot fully cover people's actual privacy concerns. Consequently, the research on general and human-centric privacy research remains rather unexplored. In this paper, we formulate the privacy issue as a reasoning problem rather than simple pattern matching. We ground on the Contextual Integrity (CI) theory which posits that people's perceptions of privacy are highly correlated with the corresponding social context. Based on such an assumption, we develop the first comprehensive checklist that covers social identities, private attributes, and existing privacy regulations. Unlike prior works on CI that either cover limited expert annotated norms or model incomplete social context, our proposed privacy checklist uses the whole Health Insurance Portability and Accountability Act of 1996 (HIPAA) as an example, to show that we can resort to large language models (LLMs) to completely cover the HIPAA's regulations. Additionally, our checklist also gathers expert annotations across multiple ontologies to determine private information including but not limited to personally identifiable information (PII). We use our preliminary results on the HIPAA to shed light on future context-centric privacy research to cover more privacy regulations, social norms and standards.

Privacy Checklist: Privacy Violation Detection Grounding on Contextual Integrity Theory

TL;DR

This work reframes privacy evaluation as context-aware reasoning grounded in Contextual Integrity, using a Privacy Checklist to model normative information flows within HIPAA. It constructs a scalable knowledge base comprising a HIPAA-centric document tree, CI-characteristics, role/attribute graphs, and a definition dictionary to enable in-context reasoning with LLMs and retrieval-augmented approaches. Through retrieval strategies (BM25, embedding similarity, and agent-based methods) and curated prompting (including CoT variants), the approach improves LLM privacy judgments on real court data, achieving notable accuracy gains. The study highlights both the potential and current limitations of LLMs for legal-privacy judgments, outlining future directions to broaden regulation coverage and enhance retrieval fidelity.

Abstract

Privacy research has attracted wide attention as individuals worry that their private data can be easily leaked during interactions with smart devices, social platforms, and AI applications. Computer science researchers, on the other hand, commonly study privacy issues through privacy attacks and defenses on segmented fields. Privacy research is conducted on various sub-fields, including Computer Vision (CV), Natural Language Processing (NLP), and Computer Networks. Within each field, privacy has its own formulation. Though pioneering works on attacks and defenses reveal sensitive privacy issues, they are narrowly trapped and cannot fully cover people's actual privacy concerns. Consequently, the research on general and human-centric privacy research remains rather unexplored. In this paper, we formulate the privacy issue as a reasoning problem rather than simple pattern matching. We ground on the Contextual Integrity (CI) theory which posits that people's perceptions of privacy are highly correlated with the corresponding social context. Based on such an assumption, we develop the first comprehensive checklist that covers social identities, private attributes, and existing privacy regulations. Unlike prior works on CI that either cover limited expert annotated norms or model incomplete social context, our proposed privacy checklist uses the whole Health Insurance Portability and Accountability Act of 1996 (HIPAA) as an example, to show that we can resort to large language models (LLMs) to completely cover the HIPAA's regulations. Additionally, our checklist also gathers expert annotations across multiple ontologies to determine private information including but not limited to personally identifiable information (PII). We use our preliminary results on the HIPAA to shed light on future context-centric privacy research to cover more privacy regulations, social norms and standards.
Paper Structure (49 sections, 2 equations, 3 figures, 10 tables)

This paper contains 49 sections, 2 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: The exemplary case to transform privacy issues into reasoning problems. Our proposed checklist collects roles, attributes, transmission principles and annotated legal norms to facilitate the reasoning process.
  • Figure 2: The overview of privacy reasoning within the given contexts. Subfigure (a) illustrates previous approaches that use formal languages to determine privacy violations based on rules of inference and axioms. Instead, in subfigure (b), we propose an in-context reasoning pipeline with our proposed Privacy Checklist and LLMs.
  • Figure 3: Manual investigations on GPT-4's errors for prohibited cases.