Demystifying Legalese: An Automated Approach for Summarizing and Analyzing Overlaps in Privacy Policies and Terms of Service
Shikha Soneji, Mitchell Hoesing, Sujay Koujalgi, Jonathan Dodge
TL;DR
This work tackles the readability gap in Terms of Service and Privacy Policy documents by building an automated, scoring-based analysis framework using the ToSDR taxonomy. It compares transformer models (RoBERTa, PrivBERT) with traditional learners on two tasks: case concept classification (246 labels) and document-type classification (5 labels), revealing RoBERTa as the strongest performer with an approximate F1 around 0.74 on the primary task. The study introduces a dual overlap analysis, measuring concept encroachment across ToS and privacy documents via pairwise accuracies and a distribution-based loss, and demonstrates meaningful, though not extreme, concept overlap. Practically, the approach supports scalable, interpretable policy analysis and highlights where GDPR-related redundancies occur, informing more transparent and compliant document drafting.
Abstract
The complexities of legalese in terms and policy documents can bind individuals to contracts they do not fully comprehend, potentially leading to uninformed data sharing. Our work seeks to alleviate this issue by developing language models that provide automated, accessible summaries and scores for such documents, aiming to enhance user understanding and facilitate informed decisions. We compared transformer-based and conventional models during training on our dataset, and RoBERTa performed better overall with a remarkable 0.74 F1-score. Leveraging our best-performing model, RoBERTa, we highlighted redundancies and potential guideline violations by identifying overlaps in GDPR-required documents, underscoring the necessity for stricter GDPR compliance.
