Table of Contents
Fetching ...

The whos, whats, and whys of issues related to personal data and data protection in open-source projects on GitHub

Anne Henning, Lukas Schulte, Steffen Herbold, Oksana Kulyk, Peter Mayer

TL;DR

The paper examines how open-source software development on GitHub is shaped by data protection regulations, focusing on how GDPR and, to a lesser extent, CCPA influence issue discussions around personal data. Using a mixed-methods approach, the authors analyze 652 manually validated GitHub issues (from a larger pool of 12,606) through inductive coding and quantitative modeling (multinomial logit, decision trees, random forests) to reveal who reports privacy concerns, how discussions unfold, and what resolutions follow. GDPR emerges as a key driver for reporting activity, with feature enhancements for privacy and consent-related topics being most common, typically addressed by core project members. The findings show that discussions sparked by regulations are effective at initiating privacy-minded changes within OSS projects, and that clear labeling and thorough discussion correlate with higher likelihoods of issue resolution. The study offers a baseline for understanding privacy dynamics in software development and suggests avenues for deeper, cross-language, and time-series analyses.

Abstract

Data protection regulations such as the General Data Protection Regulation (GDPR) in the European Union and the California Consumer Privacy Act (CCPA) in the US affect how software may handle the personal data of its users. Prior literature focused on how data protection regulations are discussed for software in operation, or how this topic is discussed in various channels outside of the software development process. Yet, what is missing, is a perspective on the impact of such regulations on the software development process. In our work, we address this gap, and explore how discussions during the development of software are impacted by regulations, who reports and discusses issues related to personal data and data protection, and how developers react to those issues. To that end, we used inductive coding to analyze 652 issues from Open Source GitHub projects and used the codes to quantitatively analyze the relation between the roles, resolutions, and data protection issues to understand correlations and predict resolutions of issues. Most notably we observed a significant increase in reporting when GDPR came into effect. The most common issue types were feature requests for privacy enhancement, which were mainly reported and discussed by frequent reporters and frequent committers. But especially issues regarding privacy enhancement were also frequently reported by one-time reporters. Most of the requests were solved without opposing votes. All in all, our findings indicate that data protection regulations effectively start discussions about privacy within the software development community.

The whos, whats, and whys of issues related to personal data and data protection in open-source projects on GitHub

TL;DR

The paper examines how open-source software development on GitHub is shaped by data protection regulations, focusing on how GDPR and, to a lesser extent, CCPA influence issue discussions around personal data. Using a mixed-methods approach, the authors analyze 652 manually validated GitHub issues (from a larger pool of 12,606) through inductive coding and quantitative modeling (multinomial logit, decision trees, random forests) to reveal who reports privacy concerns, how discussions unfold, and what resolutions follow. GDPR emerges as a key driver for reporting activity, with feature enhancements for privacy and consent-related topics being most common, typically addressed by core project members. The findings show that discussions sparked by regulations are effective at initiating privacy-minded changes within OSS projects, and that clear labeling and thorough discussion correlate with higher likelihoods of issue resolution. The study offers a baseline for understanding privacy dynamics in software development and suggests avenues for deeper, cross-language, and time-series analyses.

Abstract

Data protection regulations such as the General Data Protection Regulation (GDPR) in the European Union and the California Consumer Privacy Act (CCPA) in the US affect how software may handle the personal data of its users. Prior literature focused on how data protection regulations are discussed for software in operation, or how this topic is discussed in various channels outside of the software development process. Yet, what is missing, is a perspective on the impact of such regulations on the software development process. In our work, we address this gap, and explore how discussions during the development of software are impacted by regulations, who reports and discusses issues related to personal data and data protection, and how developers react to those issues. To that end, we used inductive coding to analyze 652 issues from Open Source GitHub projects and used the codes to quantitatively analyze the relation between the roles, resolutions, and data protection issues to understand correlations and predict resolutions of issues. Most notably we observed a significant increase in reporting when GDPR came into effect. The most common issue types were feature requests for privacy enhancement, which were mainly reported and discussed by frequent reporters and frequent committers. But especially issues regarding privacy enhancement were also frequently reported by one-time reporters. Most of the requests were solved without opposing votes. All in all, our findings indicate that data protection regulations effectively start discussions about privacy within the software development community.
Paper Structure (41 sections, 16 figures, 3 tables)

This paper contains 41 sections, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Overview of the steps involved in the methodology of this investigation.
  • Figure 2: Issue with only vague information provided.
  • Figure 3: Overview of privacy issues and the counts of how often they were observed.
  • Figure 4: Overview of consent interactions discussed within privacy issues. Omits 452 issues for which the consent interaction was not relevant.
  • Figure 5: Overview of trigger events for the creation of privacy issues.
  • ...and 11 more figures