Table of Contents
Fetching ...

Extractive text summarisation of Privacy Policy documents using machine learning approaches

Chanwoo Choi

TL;DR

The paper tackles the challenge of making Privacy Policy documents easier to understand by extracting sentence-level summaries aligned with GDPR topics. It compares two clustering-based extractive summarisation strategies: K-means and Pre-determined Centroid (PDC) clustering, with SBERT-based sentence embeddings and PCA for dimensionality reduction. Evaluations using Sum of Squared Distance (SSD) and ROUGE show that the PDC approach, which uses fixed GDPR-topic centroids, consistently outperforms K-means by about 27% on SSD and 24% on ROUGE, demonstrating the value of task-specific fine-tuning in unsupervised settings. The work highlights practical potential for GDPR-compliance screening of PP documents and sets directions for enhancing gold-standard annotations and broader applicability to data privacy legislation.

Abstract

This work demonstrates two Privacy Policy (PP) summarisation models based on two different clustering algorithms: K-means clustering and Pre-determined Centroid (PDC) clustering. K-means is decided to be used for the first model after an extensive evaluation of ten commonly used clustering algorithms. The summariser model based on the PDC-clustering algorithm summarises PP documents by segregating individual sentences by Euclidean distance from each sentence to the pre-defined cluster centres. The cluster centres are defined according to General Data Protection Regulation (GDPR)'s 14 essential topics that must be included in any privacy notices. The PDC model outperformed the K-means model for two evaluation methods, Sum of Squared Distance (SSD) and ROUGE by some margin (27% and 24% respectively). This result contrasts the K-means model's better performance in the general clustering of sentence vectors before running the task-specific evaluation. This indicates the effectiveness of operating task-specific fine-tuning measures on unsupervised machine-learning models. The summarisation mechanisms implemented in this paper demonstrates an idea of how to efficiently extract essential sentences that should be included in any PP documents. The summariser models could be further developed to an application that tests the GDPR-compliance (or any data privacy legislation) of PP documents.

Extractive text summarisation of Privacy Policy documents using machine learning approaches

TL;DR

The paper tackles the challenge of making Privacy Policy documents easier to understand by extracting sentence-level summaries aligned with GDPR topics. It compares two clustering-based extractive summarisation strategies: K-means and Pre-determined Centroid (PDC) clustering, with SBERT-based sentence embeddings and PCA for dimensionality reduction. Evaluations using Sum of Squared Distance (SSD) and ROUGE show that the PDC approach, which uses fixed GDPR-topic centroids, consistently outperforms K-means by about 27% on SSD and 24% on ROUGE, demonstrating the value of task-specific fine-tuning in unsupervised settings. The work highlights practical potential for GDPR-compliance screening of PP documents and sets directions for enhancing gold-standard annotations and broader applicability to data privacy legislation.

Abstract

This work demonstrates two Privacy Policy (PP) summarisation models based on two different clustering algorithms: K-means clustering and Pre-determined Centroid (PDC) clustering. K-means is decided to be used for the first model after an extensive evaluation of ten commonly used clustering algorithms. The summariser model based on the PDC-clustering algorithm summarises PP documents by segregating individual sentences by Euclidean distance from each sentence to the pre-defined cluster centres. The cluster centres are defined according to General Data Protection Regulation (GDPR)'s 14 essential topics that must be included in any privacy notices. The PDC model outperformed the K-means model for two evaluation methods, Sum of Squared Distance (SSD) and ROUGE by some margin (27% and 24% respectively). This result contrasts the K-means model's better performance in the general clustering of sentence vectors before running the task-specific evaluation. This indicates the effectiveness of operating task-specific fine-tuning measures on unsupervised machine-learning models. The summarisation mechanisms implemented in this paper demonstrates an idea of how to efficiently extract essential sentences that should be included in any PP documents. The summariser models could be further developed to an application that tests the GDPR-compliance (or any data privacy legislation) of PP documents.
Paper Structure (32 sections, 9 figures, 10 tables)

This paper contains 32 sections, 9 figures, 10 tables.

Figures (9)

  • Figure 1: The demonstration of "Worth reading finder" as illustrated in my last work choi2020worth
  • Figure 2: The model architecture of a basic transformer as introduced in vaswani2017attention
  • Figure 3: Explained variance (left) and Silhouette score (right) by the number of principal components.
  • Figure 4: Comparison of clustering results made by Kmeans and Affinity.
  • Figure 5: Comparison of clustering results generated by the PDC-clustering algorithm with two separate n_comp values 3 and 100.
  • ...and 4 more figures