Demystifying Legalese: An Automated Approach for Summarizing and Analyzing Overlaps in Privacy Policies and Terms of Service

Shikha Soneji; Mitchell Hoesing; Sujay Koujalgi; Jonathan Dodge

Demystifying Legalese: An Automated Approach for Summarizing and Analyzing Overlaps in Privacy Policies and Terms of Service

Shikha Soneji, Mitchell Hoesing, Sujay Koujalgi, Jonathan Dodge

TL;DR

This work tackles the readability gap in Terms of Service and Privacy Policy documents by building an automated, scoring-based analysis framework using the ToSDR taxonomy. It compares transformer models (RoBERTa, PrivBERT) with traditional learners on two tasks: case concept classification (246 labels) and document-type classification (5 labels), revealing RoBERTa as the strongest performer with an approximate F1 around 0.74 on the primary task. The study introduces a dual overlap analysis, measuring concept encroachment across ToS and privacy documents via pairwise accuracies and a distribution-based loss, and demonstrates meaningful, though not extreme, concept overlap. Practically, the approach supports scalable, interpretable policy analysis and highlights where GDPR-related redundancies occur, informing more transparent and compliant document drafting.

Abstract

The complexities of legalese in terms and policy documents can bind individuals to contracts they do not fully comprehend, potentially leading to uninformed data sharing. Our work seeks to alleviate this issue by developing language models that provide automated, accessible summaries and scores for such documents, aiming to enhance user understanding and facilitate informed decisions. We compared transformer-based and conventional models during training on our dataset, and RoBERTa performed better overall with a remarkable 0.74 F1-score. Leveraging our best-performing model, RoBERTa, we highlighted redundancies and potential guideline violations by identifying overlaps in GDPR-required documents, underscoring the necessity for stricter GDPR compliance.

Demystifying Legalese: An Automated Approach for Summarizing and Analyzing Overlaps in Privacy Policies and Terms of Service

TL;DR

Abstract

Paper Structure (19 sections, 1 equation, 5 figures, 3 tables)

This paper contains 19 sections, 1 equation, 5 figures, 3 tables.

Introduction
Background and Related Work
People's Understanding of Legal Documents
Mining and Annotation of Legal Documents
Related Work
Methodology
Stage 1: Data Collection
Stage 2: Data Preprocessing
Stage 3: Train Machine Learning Models
Stage 4: Evaluate Models and Measure Overlap
Empirical Results
RQ1 - Case Prediction
RQ2 - Overlap Quantification
RQ3 - Cases that Overlap
Discussion
...and 4 more sections

Figures (5)

Figure 1: Data Dissection
Figure 2: Distribution of examples over our labels for... (left:) Cases, consisting of 246 labels with frequencies varying 575--1; and (right:) DocTypes, consisting of 5 labels with frequencies varying 6409--115.
Figure 3: Overlapping cases where most instances appear in a Privacy Policy. Cases that we labelled as "Privacy Related" are marked with a red $\circ$ (4 out of 7 cases).
Figure 4: Overlapping cases where most instances appear in a Terms of Service. Cases that we labelled as "Privacy Related" are marked with a red $\circ$ (4 out of 14 cases).
Figure 5: Frequency of overlapping cases between Privacy Policy and Terms of Service. Cases that we labelled as "Privacy Related" are marked with a red $\circ$ (2 out of 14 cases).

Demystifying Legalese: An Automated Approach for Summarizing and Analyzing Overlaps in Privacy Policies and Terms of Service

TL;DR

Abstract

Demystifying Legalese: An Automated Approach for Summarizing and Analyzing Overlaps in Privacy Policies and Terms of Service

Authors

TL;DR

Abstract

Table of Contents

Figures (5)