Automated Detection and Analysis of Data Practices Using A Real-World Corpus
Mukund Srinath, Pranav Venkit, Maria Badillo, Florian Schaub, C. Lee Giles, Shomir Wilson
TL;DR
This work tackles the challenge of making privacy policies more usable by automatically mapping policy excerpts to predefined data-practice descriptions derived from the crowd-sourced ToS;DR dataset. It introduces an automated pipeline that uses RS and CBS sampling to train PrivBERT for binary matching between policy excerpts and data-practice cases, achieving state-of-the-art results and demonstrating robustness in a real-world case study (Airbnb). The proposed multi-level privacy label presents data practices at varying granularity, enabling scalable, user-friendly summaries that can empower users to understand what happens with their data. Overall, the approach advances usable privacy by combining crowdsourced annotations, targeted sampling strategies, and a privacy-labeling interface suitable for deployment in consumer contexts.
Abstract
Privacy policies are crucial for informing users about data practices, yet their length and complexity often deter users from reading them. In this paper, we propose an automated approach to identify and visualize data practices within privacy policies at different levels of detail. Leveraging crowd-sourced annotations from the ToS;DR platform, we experiment with various methods to match policy excerpts with predefined data practice descriptions. We further conduct a case study to evaluate our approach on a real-world policy, demonstrating its effectiveness in simplifying complex policies. Experiments show that our approach accurately matches data practice descriptions with policy excerpts, facilitating the presentation of simplified privacy information to users.
