Open Source Software Lifecycle Classification: Developing Wrangling Techniques for Complex Sociotechnical Systems
Wenyi Lu, Enock Kasaadah, S M Rakib Ul Karim, Matt Germonprez, Sean Goggins
TL;DR
This study tackles the problem of differentiating OSS projects by their lifecycle position to support governance and sustainability insights. It adopts a mixed-methods approach, combining CHAOSS-derived sociotechnical metrics with machine learning to classify CNCF-hosted OSS projects into Sandbox, Incubating, and Graduated stages. The empirical results show the Decision Tree model achieving $90.3\%$ accuracy, with New Contributor Count, Stars Count, Pull Request Average Commits, and Dependency Count emerging as the most influential factors, reflecting intertwined social and technical dimensions. The work provides a practical pathway for lifecycle-aware OSS analysis and suggests paths forward through genre-based classification to capture the diverse, evolving nature of OSS ecosystems.
Abstract
Open source software is a rapidly evolving center for distributed work, and understanding the characteristics of this work across its different contexts is vital for informing policy, economics, and the design of enabling software. The steep increase in open source projects and corporate participation have transformed a peripheral, cottage industry component of the global technology ecosystem into a large, infinitely complex "technology parts supplier" wired into every corner of contemporary life. The lack of theory and tools for breaking this complexity down into identifiable project types or strategies for understanding them more systematically is incommensurate with current industry, society, and developer needs. This paper reviews previous attempts to classify open source software and other organizational ecosystems, using open source scientific software ecosystems in contrast with those found in corporatized open source software. It then examines the divergent and sometimes conflicting purposes that may exist for classifying open source projects and how these competing interests impede our progress in developing a comprehensive understanding of how open source software projects and companies operate. Finally, we will present an empirical, mixed-methods study demonstrating how to classify open-source projects by their lifecycle position. This is the first step forward, advancing our scientific and practical knowledge of open source software through the lens of dynamic and evolving open source genres. It concludes with examples and a proposed path forward.
