Table of Contents
Fetching ...

Open Source Software Lifecycle Classification: Developing Wrangling Techniques for Complex Sociotechnical Systems

Wenyi Lu, Enock Kasaadah, S M Rakib Ul Karim, Matt Germonprez, Sean Goggins

TL;DR

This study tackles the problem of differentiating OSS projects by their lifecycle position to support governance and sustainability insights. It adopts a mixed-methods approach, combining CHAOSS-derived sociotechnical metrics with machine learning to classify CNCF-hosted OSS projects into Sandbox, Incubating, and Graduated stages. The empirical results show the Decision Tree model achieving $90.3\%$ accuracy, with New Contributor Count, Stars Count, Pull Request Average Commits, and Dependency Count emerging as the most influential factors, reflecting intertwined social and technical dimensions. The work provides a practical pathway for lifecycle-aware OSS analysis and suggests paths forward through genre-based classification to capture the diverse, evolving nature of OSS ecosystems.

Abstract

Open source software is a rapidly evolving center for distributed work, and understanding the characteristics of this work across its different contexts is vital for informing policy, economics, and the design of enabling software. The steep increase in open source projects and corporate participation have transformed a peripheral, cottage industry component of the global technology ecosystem into a large, infinitely complex "technology parts supplier" wired into every corner of contemporary life. The lack of theory and tools for breaking this complexity down into identifiable project types or strategies for understanding them more systematically is incommensurate with current industry, society, and developer needs. This paper reviews previous attempts to classify open source software and other organizational ecosystems, using open source scientific software ecosystems in contrast with those found in corporatized open source software. It then examines the divergent and sometimes conflicting purposes that may exist for classifying open source projects and how these competing interests impede our progress in developing a comprehensive understanding of how open source software projects and companies operate. Finally, we will present an empirical, mixed-methods study demonstrating how to classify open-source projects by their lifecycle position. This is the first step forward, advancing our scientific and practical knowledge of open source software through the lens of dynamic and evolving open source genres. It concludes with examples and a proposed path forward.

Open Source Software Lifecycle Classification: Developing Wrangling Techniques for Complex Sociotechnical Systems

TL;DR

This study tackles the problem of differentiating OSS projects by their lifecycle position to support governance and sustainability insights. It adopts a mixed-methods approach, combining CHAOSS-derived sociotechnical metrics with machine learning to classify CNCF-hosted OSS projects into Sandbox, Incubating, and Graduated stages. The empirical results show the Decision Tree model achieving accuracy, with New Contributor Count, Stars Count, Pull Request Average Commits, and Dependency Count emerging as the most influential factors, reflecting intertwined social and technical dimensions. The work provides a practical pathway for lifecycle-aware OSS analysis and suggests paths forward through genre-based classification to capture the diverse, evolving nature of OSS ecosystems.

Abstract

Open source software is a rapidly evolving center for distributed work, and understanding the characteristics of this work across its different contexts is vital for informing policy, economics, and the design of enabling software. The steep increase in open source projects and corporate participation have transformed a peripheral, cottage industry component of the global technology ecosystem into a large, infinitely complex "technology parts supplier" wired into every corner of contemporary life. The lack of theory and tools for breaking this complexity down into identifiable project types or strategies for understanding them more systematically is incommensurate with current industry, society, and developer needs. This paper reviews previous attempts to classify open source software and other organizational ecosystems, using open source scientific software ecosystems in contrast with those found in corporatized open source software. It then examines the divergent and sometimes conflicting purposes that may exist for classifying open source projects and how these competing interests impede our progress in developing a comprehensive understanding of how open source software projects and companies operate. Finally, we will present an empirical, mixed-methods study demonstrating how to classify open-source projects by their lifecycle position. This is the first step forward, advancing our scientific and practical knowledge of open source software through the lens of dynamic and evolving open source genres. It concludes with examples and a proposed path forward.

Paper Structure

This paper contains 41 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: These ridge line plots visually represent how the quartile distribution of CNCF projects for each of our model's four most significant metrics distinguishes between the three classes. In each plot, "sandbox" projects are at the top, "incubating" projects are in the middle, and "grads" are last.
  • Figure 2: Ridge line plots for the seven factors in the model for classifying projects according to their lifecycle stage that were not discussed in the core narrative of the paper.