Table of Contents
Fetching ...

PETLP: A Privacy-by-Design Pipeline for Social Media Data in AI Research

Nick Oh, Giorgos D. Vrakas, Siân J. M. Brooke, Sasha Morinière, Toju Duke

TL;DR

PETLP addresses the fragmented governance of social media data by embedding privacy, IP, and contractual considerations into a four-phase ETL-like pipeline with a living DPIA across the research lifecycle. It provides practical decision trees and a Reddit-case demonstration to show how platform terms, copyright, and GDPR interact in data extraction, transformation, loading, and dissemination. The framework highlights the challenges of true anonymisation, model distribution risks, and the need for continuous compliance, while offering concrete mechanisms (e.g., DP, DPIA living documents, and platform-aware extraction channels) to enable responsible AI research at scale. By reframing compliance as a design principle rather than a post hoc check, PETLP aims to strengthen research reproducibility, public trust, and regulatory alignment in social media AI work.

Abstract

Social media data presents AI researchers with overlapping obligations under the GDPR, copyright law, and platform terms -- yet existing frameworks fail to integrate these regulatory domains, leaving researchers without unified guidance. We introduce PETLP (Privacy-by-design Extract, Transform, Load, and Present), a compliance framework that embeds legal safeguards directly into extended ETL pipelines. Central to PETLP is treating Data Protection Impact Assessments as living documents that evolve from pre-registration through dissemination. Through systematic Reddit analysis, we demonstrate how extraction rights fundamentally differ between qualifying research organisations (who can invoke DSM Article 3 to override platform restrictions) and commercial entities (bound by terms of service), whilst GDPR obligations apply universally. We demonstrate why true anonymisation remains unachievable for social media data and expose the legal gap between permitted dataset creation and uncertain model distribution. By structuring compliance decisions into practical workflows and simplifying institutional data management plans, PETLP enables researchers to navigate regulatory complexity with confidence, bridging the gap between legal requirements and research practice.

PETLP: A Privacy-by-Design Pipeline for Social Media Data in AI Research

TL;DR

PETLP addresses the fragmented governance of social media data by embedding privacy, IP, and contractual considerations into a four-phase ETL-like pipeline with a living DPIA across the research lifecycle. It provides practical decision trees and a Reddit-case demonstration to show how platform terms, copyright, and GDPR interact in data extraction, transformation, loading, and dissemination. The framework highlights the challenges of true anonymisation, model distribution risks, and the need for continuous compliance, while offering concrete mechanisms (e.g., DP, DPIA living documents, and platform-aware extraction channels) to enable responsible AI research at scale. By reframing compliance as a design principle rather than a post hoc check, PETLP aims to strengthen research reproducibility, public trust, and regulatory alignment in social media AI work.

Abstract

Social media data presents AI researchers with overlapping obligations under the GDPR, copyright law, and platform terms -- yet existing frameworks fail to integrate these regulatory domains, leaving researchers without unified guidance. We introduce PETLP (Privacy-by-design Extract, Transform, Load, and Present), a compliance framework that embeds legal safeguards directly into extended ETL pipelines. Central to PETLP is treating Data Protection Impact Assessments as living documents that evolve from pre-registration through dissemination. Through systematic Reddit analysis, we demonstrate how extraction rights fundamentally differ between qualifying research organisations (who can invoke DSM Article 3 to override platform restrictions) and commercial entities (bound by terms of service), whilst GDPR obligations apply universally. We demonstrate why true anonymisation remains unachievable for social media data and expose the legal gap between permitted dataset creation and uncertain model distribution. By structuring compliance decisions into practical workflows and simplifying institutional data management plans, PETLP enables researchers to navigate regulatory complexity with confidence, bridging the gap between legal requirements and research practice.

Paper Structure

This paper contains 98 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: Privacy-by-design ETLP (PETLP) framework for social media AI research. Pipeline stages (Appendix D, Figure 2-10) and Reddit case study (Appendix E).
  • Figure 2: Determining Controller Relationships in AI Research Projects. Identifying joint controllers, processors, and independent controllers for GDPR compliance
  • Figure 3: GDPR Legal Basis Selection for Social Media Research. Navigating consent, legitimate interest, and public task for AI research projects
  • Figure 4: Research Organisation Qualification Under DSM Article 3. Determining eligibility for enhanced text and data mining rights
  • Figure 5: Platform Restrictions versus Legal Rights in AI Research. When terms of service conflict with statutory research exemptions
  • ...and 6 more figures