Table of Contents
Fetching ...

Exploring User Privacy Awareness on GitHub: An Empirical Study

Costanza Alfieri, Juri Di Rocco, Paola Inverardi, Phuong T. Nguyen

TL;DR

The paper tackles the problem that GitHub privacy settings, though designed to protect users, may not fully safeguard personal information disclosed in daily activity. It presents an empirical study using GHTorrent data and API updates to build two datasets (Users and Active users) and a 15,672-comment PRIVACY-labeled corpus from pull_request discussions to examine how privacy settings are adopted and how sensitive information is disclosed. The authors show four distinct privacy profiles among users and reveal frequent self-disclosures beyond profile-visibility controls, indicating a privacy paradox or fatigue in practice. They also explore automated detection of sensitive comments with fine-tuned Llama2 and BERT models, finding that fine-tuned models substantially outperform zero-shot approaches and laying groundwork for a privacy-aware assistant that could guide users toward more privacy-consistent comments.

Abstract

GitHub provides developers with a practical way to distribute source code and collaboratively work on common projects. To enhance account security and privacy, GitHub allows its users to manage access permissions, review audit logs, and enable two-factor authentication. However, despite the endless effort, the platform still faces various issues related to the privacy of its users. This paper presents an empirical study delving into the GitHub ecosystem. Our focus is on investigating the utilization of privacy settings on the platform and identifying various types of sensitive information disclosed by users. Leveraging a dataset comprising 6,132 developers, we report and analyze their activities by means of comments on pull requests. Our findings indicate an active engagement by users with the available privacy settings on GitHub. Notably, we observe the disclosure of different forms of private information within pull request comments. This observation has prompted our exploration into sensitivity detection using a large language model and BERT, to pave the way for a personalized privacy assistant. Our work provides insights into the utilization of existing privacy protection tools, such as privacy settings, along with their inherent limitations. Essentially, we aim to advance research in this field by providing both the motivation for creating such privacy protection tools and a proposed methodology for personalizing them.

Exploring User Privacy Awareness on GitHub: An Empirical Study

TL;DR

The paper tackles the problem that GitHub privacy settings, though designed to protect users, may not fully safeguard personal information disclosed in daily activity. It presents an empirical study using GHTorrent data and API updates to build two datasets (Users and Active users) and a 15,672-comment PRIVACY-labeled corpus from pull_request discussions to examine how privacy settings are adopted and how sensitive information is disclosed. The authors show four distinct privacy profiles among users and reveal frequent self-disclosures beyond profile-visibility controls, indicating a privacy paradox or fatigue in practice. They also explore automated detection of sensitive comments with fine-tuned Llama2 and BERT models, finding that fine-tuned models substantially outperform zero-shot approaches and laying groundwork for a privacy-aware assistant that could guide users toward more privacy-consistent comments.

Abstract

GitHub provides developers with a practical way to distribute source code and collaboratively work on common projects. To enhance account security and privacy, GitHub allows its users to manage access permissions, review audit logs, and enable two-factor authentication. However, despite the endless effort, the platform still faces various issues related to the privacy of its users. This paper presents an empirical study delving into the GitHub ecosystem. Our focus is on investigating the utilization of privacy settings on the platform and identifying various types of sensitive information disclosed by users. Leveraging a dataset comprising 6,132 developers, we report and analyze their activities by means of comments on pull requests. Our findings indicate an active engagement by users with the available privacy settings on GitHub. Notably, we observe the disclosure of different forms of private information within pull request comments. This observation has prompted our exploration into sensitivity detection using a large language model and BERT, to pave the way for a personalized privacy assistant. Our work provides insights into the utilization of existing privacy protection tools, such as privacy settings, along with their inherent limitations. Essentially, we aim to advance research in this field by providing both the motivation for creating such privacy protection tools and a proposed methodology for personalizing them.
Paper Structure (36 sections, 4 equations, 15 figures, 11 tables)

This paper contains 36 sections, 4 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: GitHub Privacy Statement on settings.
  • Figure 2: Privacy settings on GitHub.
  • Figure 3: Examples of companies requiring the GitHub profile.
  • Figure 4: Workflow of the study.
  • Figure 5: Correlation matrix of the variables in the Users dataset.
  • ...and 10 more figures