Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies
Mukund Srinath, Shomir Wilson, C. Lee Giles
TL;DR
This work addresses the scarcity of large-scale privacy policy data and the need for NLP tools to help users understand privacy notices. It introduces PrivaSeer, a web-scale corpus of over 1 million English privacy policies gathered through a focused crawling and filtering pipeline, with deduplication and content extraction, yielding 1,005,380 policies from 995,475 domains. The authors train PrivBERT, a RoBERTa-based in-domain model, and demonstrate state-of-the-art performance on data-practice classification and privacy-question answering tasks, while also providing broad analyses of readability and topic distributions at web scale. By releasing the corpus, a search tool, the collection pipeline, and PrivBERT, the work aims to advance research in automatic privacy-policy understanding, with implications for users, regulators, and researchers.
Abstract
Organisations disclose their privacy practices by posting privacy policies on their website. Even though users often care about their digital privacy, they often don't read privacy policies since they require a significant investment in time and effort. Although natural language processing can help in privacy policy understanding, there has been a lack of large scale privacy policy corpora that could be used to analyse, understand, and simplify privacy policies. Thus, we create PrivaSeer, a corpus of over one million English language website privacy policies, which is significantly larger than any previously available corpus. We design a corpus creation pipeline which consists of crawling the web followed by filtering documents using language detection, document classification, duplicate and near-duplication removal, and content extraction. We investigate the composition of the corpus and show results from readability tests, document similarity, keyphrase extraction, and explored the corpus through topic modeling.
