Evolution of repositories and privacy laws: commit activities in the GDPR and CCPA era
Georgia M. Kapitsaki, Maria Papoutsoglou
TL;DR
The paper analyzes how major data privacy laws (GDPR, CCPA/CPRA, UK DPA) influence commit activity on GitHub by assembling a large corpus of privacy-related commits, characterizing their timing, scope, and content. Using a combination of automated keyword-based collection and manual validation, it shows GDPR-driven updates dominate, especially around the 2018 law enactment, with limited explicit references to specific data-rights in commit messages. The study highlights the need for better privacy education, automated compliance tooling, and verification of actual code-level compliance, rather than solely textual indicators. The resulting dataset and insights offer a foundation for privacy-aware software engineering research and tooling development.
Abstract
Free and open source software has gained a lot of momentum in the industry and the research community. The latest advances in privacy legislation, including the EU General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), have forced the community to pay special attention to users' data privacy. The main aim of this work is to examine software repositories that are acting on privacy laws. We have collected commit data from GitHub repositories in order to understand indications on main data privacy laws (GDPR, CCPA, CPRA, UK DPA) in the last years. Via an automated process, we analyzed 37,213 commits from 12,391 repositories since 2016, whereas 594 commits from the 70 most popular repositories of the dataset were manually analyzed. We observe that most commits were performed on the year the law came into effect and privacy relevant terms appear in the commit messages, whereas reference to specific data privacy user rights is scarce. The study showed that more educational activities on data privacy user rights are needed, as well as tools for privacy recommendations, whereas verifying actual compliance via source code execution is a useful direction for software engineering researchers.
