Table of Contents
Fetching ...

Recipe for Discovery: A Pipeline for Institutional Open Source Activity

Juanita Gomez, Emily Lovell, Stephanie Lieggi, Alvaro A. Cardenas, James Davis

TL;DR

An end-to-end framework for systematically discovering and analyzing open source projects across distributed academic systems is presented, enabling actionable insights into institutional open source practices, revealing patterns such as missing licenses or limited community engagement.

Abstract

Open source software development, particularly within institutions such as universities and research laboratories, is often decentralized and difficult to track. Although academic teams produce many impactful scientific tools, their projects do not always follow consistent open source practices, such as clear licensing, documentation, or community engagement. As a result, these efforts often go unrecognized due to limited visibility and institutional awareness, and the software itself can be difficult to sustain over time. This paper presents an end-to-end framework for systematically discovering and analyzing open source projects across distributed academic systems. Using ten universities as a case study, we build a pipeline that collects data via GitHub's REST API, extracts metadata, and predicts both institutional affiliation and project type (e.g., development tools, educational materials, websites, documentation). Applied across the ten campuses, our method identifies over 200,000 repositories and collects information on their activity and open source practices, enabling a deeper understanding of institutional open source contributions. Beyond discovery, our framework enables actionable insights into institutional open source practices, revealing patterns such as missing licenses or limited community engagement. These findings can guide targeted support, policy development, and strategies to strengthen open source contributions across academic institutions.

Recipe for Discovery: A Pipeline for Institutional Open Source Activity

TL;DR

An end-to-end framework for systematically discovering and analyzing open source projects across distributed academic systems is presented, enabling actionable insights into institutional open source practices, revealing patterns such as missing licenses or limited community engagement.

Abstract

Open source software development, particularly within institutions such as universities and research laboratories, is often decentralized and difficult to track. Although academic teams produce many impactful scientific tools, their projects do not always follow consistent open source practices, such as clear licensing, documentation, or community engagement. As a result, these efforts often go unrecognized due to limited visibility and institutional awareness, and the software itself can be difficult to sustain over time. This paper presents an end-to-end framework for systematically discovering and analyzing open source projects across distributed academic systems. Using ten universities as a case study, we build a pipeline that collects data via GitHub's REST API, extracts metadata, and predicts both institutional affiliation and project type (e.g., development tools, educational materials, websites, documentation). Applied across the ten campuses, our method identifies over 200,000 repositories and collects information on their activity and open source practices, enabling a deeper understanding of institutional open source contributions. Beyond discovery, our framework enables actionable insights into institutional open source practices, revealing patterns such as missing licenses or limited community engagement. These findings can guide targeted support, policy development, and strategies to strengthen open source contributions across academic institutions.

Paper Structure

This paper contains 29 sections, 1 equation, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Data collection phases with GitHub's REST API.
  • Figure 2: ROC curves for UCSB, UCSC, and UCSD for the five methods evaluated. gpt-5-mini achieves the highest area under the curve (AUC) for the three campuses, indicating strong classification performance.
  • Figure 3: Programming language distribution across all ten University of California campuses. Languages representing less than 2% of total usage were aggregated under “Other.” The bar on the right labeled “Project type” shows the overall distribution of repository types and serves as a baseline for interpreting language usage within each category.
  • Figure 4: Software License usage across all 10 UCs. Licenses representing less than 1% of total usage were aggregated under “Other.” The bar on the right labeled “Project type” shows the overall distribution of repository types and serves as a baseline for interpreting distribution within each type.
  • Figure 5: Presence of community files across all ten University of California campuses. The bar on the right labeled “Project type” shows the overall distribution of repository types and serves as a baseline for interpreting community file usage within each category.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2