Table of Contents
Fetching ...

Evolution of repositories and privacy laws: commit activities in the GDPR and CCPA era

Georgia M. Kapitsaki, Maria Papoutsoglou

TL;DR

The paper analyzes how major data privacy laws (GDPR, CCPA/CPRA, UK DPA) influence commit activity on GitHub by assembling a large corpus of privacy-related commits, characterizing their timing, scope, and content. Using a combination of automated keyword-based collection and manual validation, it shows GDPR-driven updates dominate, especially around the 2018 law enactment, with limited explicit references to specific data-rights in commit messages. The study highlights the need for better privacy education, automated compliance tooling, and verification of actual code-level compliance, rather than solely textual indicators. The resulting dataset and insights offer a foundation for privacy-aware software engineering research and tooling development.

Abstract

Free and open source software has gained a lot of momentum in the industry and the research community. The latest advances in privacy legislation, including the EU General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), have forced the community to pay special attention to users' data privacy. The main aim of this work is to examine software repositories that are acting on privacy laws. We have collected commit data from GitHub repositories in order to understand indications on main data privacy laws (GDPR, CCPA, CPRA, UK DPA) in the last years. Via an automated process, we analyzed 37,213 commits from 12,391 repositories since 2016, whereas 594 commits from the 70 most popular repositories of the dataset were manually analyzed. We observe that most commits were performed on the year the law came into effect and privacy relevant terms appear in the commit messages, whereas reference to specific data privacy user rights is scarce. The study showed that more educational activities on data privacy user rights are needed, as well as tools for privacy recommendations, whereas verifying actual compliance via source code execution is a useful direction for software engineering researchers.

Evolution of repositories and privacy laws: commit activities in the GDPR and CCPA era

TL;DR

The paper analyzes how major data privacy laws (GDPR, CCPA/CPRA, UK DPA) influence commit activity on GitHub by assembling a large corpus of privacy-related commits, characterizing their timing, scope, and content. Using a combination of automated keyword-based collection and manual validation, it shows GDPR-driven updates dominate, especially around the 2018 law enactment, with limited explicit references to specific data-rights in commit messages. The study highlights the need for better privacy education, automated compliance tooling, and verification of actual code-level compliance, rather than solely textual indicators. The resulting dataset and insights offer a foundation for privacy-aware software engineering research and tooling development.

Abstract

Free and open source software has gained a lot of momentum in the industry and the research community. The latest advances in privacy legislation, including the EU General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), have forced the community to pay special attention to users' data privacy. The main aim of this work is to examine software repositories that are acting on privacy laws. We have collected commit data from GitHub repositories in order to understand indications on main data privacy laws (GDPR, CCPA, CPRA, UK DPA) in the last years. Via an automated process, we analyzed 37,213 commits from 12,391 repositories since 2016, whereas 594 commits from the 70 most popular repositories of the dataset were manually analyzed. We observe that most commits were performed on the year the law came into effect and privacy relevant terms appear in the commit messages, whereas reference to specific data privacy user rights is scarce. The study showed that more educational activities on data privacy user rights are needed, as well as tools for privacy recommendations, whereas verifying actual compliance via source code execution is a useful direction for software engineering researchers.

Paper Structure

This paper contains 19 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Methodological steps.
  • Figure 2: Privacy laws appearing over the years counting number of commits.
  • Figure 3: Number of days between first and last commit in repository
  • Figure 4: Frequency of privacy law relevant commits in repositories.
  • Figure 5: Wordclouds from text of commit messages.