Natural Language Processing of Privacy Policies: A Survey
Andrick Adhikari, Sanchari Das, Rinku Dewri
TL;DR
This survey addresses a pressing problem: privacy policies are lengthy and hard to understand, hindering user privacy decisions. It surveys 109 NLP papers to map how information retrieval, summarization, QA, classification, and alignment have been applied to privacy texts, and it analyzes the role of privacy-policy corpora such as OPP-115, PPCRAWL, and Privaseer. Major contributions include a taxonomy of NLP tasks in the privacy domain, assessment of domain-specific embeddings (e.g., Polisis), and discussion of LLM-enabled approaches (e.g., PolicyGPT) alongside challenges like burdened practicability and high computation. The work provides a roadmap for developing unified, user-centered privacy-policy tools that can operate across domains and regulatory regimes, emphasizing corpora creation, domain adaptation, and multidimensional outputs to support real-world usability.
Abstract
Natural Language Processing (NLP) is an essential subset of artificial intelligence. It has become effective in several domains, such as healthcare, finance, and media, to identify perceptions, opinions, and misuse, among others. Privacy is no exception, and initiatives have been taken to address the challenges of usable privacy notifications to users with the help of NLP. To this aid, we conduct a literature review by analyzing 109 papers at the intersection of NLP and privacy policies. First, we provide a brief introduction to privacy policies and discuss various facets of associated problems, which necessitate the application of NLP to elevate the current state of privacy notices and disclosures to users. Subsequently, we a) provide an overview of the implementation and effectiveness of NLP approaches for better privacy policy communication; b) identify the methodologies that can be further enhanced to provide robust privacy policies; and c) identify the gaps in the current state-of-the-art research. Our systematic analysis reveals that several research papers focus on annotating and classifying privacy texts for analysis but need to adequately dwell on other aspects of NLP applications, such as summarization. More specifically, ample research opportunities exist in this domain, covering aspects such as corpus generation, summarization vectors, contextualized word embedding, identification of privacy-relevant statement categories, fine-grained classification, and domain-specific model tuning.
