Table of Contents
Fetching ...

Who Do You Think You Are? Creating RSE Personas from GitHub Interactions

Felicity Anderson, Julien Sindt, Neil Chue Hong

TL;DR

This paper develops data-driven RSE Personas by mining GitHub interactions from a large set of open RS repositories, identifying seven distinct contributor patterns (from Ephemeral to Active Contributors) across 115,174 repo-individuals in 1,284 RS repositories. It combines hierarchical clustering with PCA validation, using interaction types (Commit, Issue, and Pull Request activities) and their volumes to group contributors, while highlighting the pivotal role of PR Closure and Issue Assignment in distinguishing personas. The study demonstrates the feasibility of scalable, data-driven persona derivation in diverse RS contexts and discusses limitations such as platform bias, bot influence, and the need for richer interaction signals. Practical impact includes providing RS project teams with tangible personas to inform credit attribution, workload planning, and targeted team development, alongside a foundation for ongoing persona dynamics research and cross-platform validation.

Abstract

We describe data-driven RSE personas: an approach combining software repository mining and data-driven personas applied to research software (RS), an attempt to describe and identify common and rare patterns of Research Software Engineering (RSE) development. This allows individuals and RS project teams to understand their contributions, impact and repository dynamics - an important foundation for improving RSE. We evaluate the method on different patterns of collaborative interaction behaviours by contributors to mid-sized public RS repositories (those with 10-300 committers) on GitHub. We demonstrate how the RSE personas method successfully characterises a sample of 115,174 repository contributors across 1,284 RS repositories on GitHub, sampled from 42,284 candidate software repository records queried from Zenodo. We identify, name and summarise seven distinct personas from low to high interactivity: Ephemeral Contributor; Occasional Contributor; Project Organiser; Moderate Contributor; Low-Process Closer; Low-Coding Closer; and Active Contributor. This demonstrates that large datasets can be analysed despite difficulties of comparing software projects with different project management factors, research domains and contributor backgrounds.

Who Do You Think You Are? Creating RSE Personas from GitHub Interactions

TL;DR

This paper develops data-driven RSE Personas by mining GitHub interactions from a large set of open RS repositories, identifying seven distinct contributor patterns (from Ephemeral to Active Contributors) across 115,174 repo-individuals in 1,284 RS repositories. It combines hierarchical clustering with PCA validation, using interaction types (Commit, Issue, and Pull Request activities) and their volumes to group contributors, while highlighting the pivotal role of PR Closure and Issue Assignment in distinguishing personas. The study demonstrates the feasibility of scalable, data-driven persona derivation in diverse RS contexts and discusses limitations such as platform bias, bot influence, and the need for richer interaction signals. Practical impact includes providing RS project teams with tangible personas to inform credit attribution, workload planning, and targeted team development, alongside a foundation for ongoing persona dynamics research and cross-platform validation.

Abstract

We describe data-driven RSE personas: an approach combining software repository mining and data-driven personas applied to research software (RS), an attempt to describe and identify common and rare patterns of Research Software Engineering (RSE) development. This allows individuals and RS project teams to understand their contributions, impact and repository dynamics - an important foundation for improving RSE. We evaluate the method on different patterns of collaborative interaction behaviours by contributors to mid-sized public RS repositories (those with 10-300 committers) on GitHub. We demonstrate how the RSE personas method successfully characterises a sample of 115,174 repository contributors across 1,284 RS repositories on GitHub, sampled from 42,284 candidate software repository records queried from Zenodo. We identify, name and summarise seven distinct personas from low to high interactivity: Ephemeral Contributor; Occasional Contributor; Project Organiser; Moderate Contributor; Low-Process Closer; Low-Coding Closer; and Active Contributor. This demonstrates that large datasets can be analysed despite difficulties of comparing software projects with different project management factors, research domains and contributor backgrounds.

Paper Structure

This paper contains 58 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Data collection and sampling workflow (\ref{['fig:data-collection-workflow']}); Evaluation methods including calculating CH Index (\ref{['fig:CH-index']}) and visualisation with a dendrogram (\ref{['fig:dendro']}).
  • Figure 2: Repository sizes (in numbers of contributors) at population and sample levels.
  • Figure 3: Most repo-individuals are from low-interactivity groupings, while most repositories include individuals from at least two different interactivity groups.
  • Figure 4: Principal Component Analysis results shows greatest separation on the first axis, explaining 81.36% of variance. Subsequent axes only add 10.12% of variance, bringing the total to 91.44%. Low Interactivity Grouping (Cluster 2) (yellow) shows the tightest clustering, and significant overlap with Moderate Interactivity Grouping (Cluster 0) (red).
  • Figure 5: Key variables splitting initial- and sub- clusters and describing RSE Persona differences, as well as proportions of sample occupied by each persona.
  • ...and 5 more figures