Guiding Effort Allocation in Open-Source Software Projects Using Bus Factor Analysis
Aliza Lisan, Boyana Norris
TL;DR
This work analyzes the risk of knowledge concentration in open-source software through the Bus Factor (BF), defined as the minimum number of key developers whose loss would impede progress. It introduces two alternative metrics, LOCC and change-size-cos, integrated with the CST algorithm, and compares them against a git-blame-based RIG algorithm across five HPC GitHub projects, including validation with project maintainers. The study demonstrates that LOCC- and change-size-cos–based CST yields BF estimates closer to maintainers’ expectations than commit-based CST or RIG, while RIG shows non-deterministic and often inflated results for large projects. It also examines BF trends over five years and analyzes directory-level BF to guide hiring and knowledge-transfer strategies, highlighting scalability limitations of existing tools for large repositories and proposing directions for future work and validation using data-driven methods.
Abstract
A critical issue faced by open-source software projects is the risk of key personnel leaving the project. This risk is exacerbated in large projects that have been under development for a long time and experienced growth in their development teams. One way to quantify this risk is to measure the concentration of knowledge about the project among its developers. Formally known as the Bus Factor (BF) of a project and defined as 'the number of key developers who would need to be incapacitated to make a project unable to proceed'. Most of the proposed algorithms for BF calculation measure a developer's knowledge of a file based on the number of commits. In this work, we propose using other metrics like lines of code changes (LOCC) and cosine difference of lines of code (change-size-cos) to calculate the BF. We use these metrics for BF calculation for five open-source GitHub projects using the CST algorithm and the RIG algorithm, which is git-blame-based. Moreover, we calculate the BF on project sub-directories that have seen the most active development recently. Lastly, we compare the results of the two algorithms in accuracy, similarity in results, execution time, and trends in BF values over time.
