Table of Contents
Fetching ...

Guiding Effort Allocation in Open-Source Software Projects Using Bus Factor Analysis

Aliza Lisan, Boyana Norris

TL;DR

This work analyzes the risk of knowledge concentration in open-source software through the Bus Factor (BF), defined as the minimum number of key developers whose loss would impede progress. It introduces two alternative metrics, LOCC and change-size-cos, integrated with the CST algorithm, and compares them against a git-blame-based RIG algorithm across five HPC GitHub projects, including validation with project maintainers. The study demonstrates that LOCC- and change-size-cos–based CST yields BF estimates closer to maintainers’ expectations than commit-based CST or RIG, while RIG shows non-deterministic and often inflated results for large projects. It also examines BF trends over five years and analyzes directory-level BF to guide hiring and knowledge-transfer strategies, highlighting scalability limitations of existing tools for large repositories and proposing directions for future work and validation using data-driven methods.

Abstract

A critical issue faced by open-source software projects is the risk of key personnel leaving the project. This risk is exacerbated in large projects that have been under development for a long time and experienced growth in their development teams. One way to quantify this risk is to measure the concentration of knowledge about the project among its developers. Formally known as the Bus Factor (BF) of a project and defined as 'the number of key developers who would need to be incapacitated to make a project unable to proceed'. Most of the proposed algorithms for BF calculation measure a developer's knowledge of a file based on the number of commits. In this work, we propose using other metrics like lines of code changes (LOCC) and cosine difference of lines of code (change-size-cos) to calculate the BF. We use these metrics for BF calculation for five open-source GitHub projects using the CST algorithm and the RIG algorithm, which is git-blame-based. Moreover, we calculate the BF on project sub-directories that have seen the most active development recently. Lastly, we compare the results of the two algorithms in accuracy, similarity in results, execution time, and trends in BF values over time.

Guiding Effort Allocation in Open-Source Software Projects Using Bus Factor Analysis

TL;DR

This work analyzes the risk of knowledge concentration in open-source software through the Bus Factor (BF), defined as the minimum number of key developers whose loss would impede progress. It introduces two alternative metrics, LOCC and change-size-cos, integrated with the CST algorithm, and compares them against a git-blame-based RIG algorithm across five HPC GitHub projects, including validation with project maintainers. The study demonstrates that LOCC- and change-size-cos–based CST yields BF estimates closer to maintainers’ expectations than commit-based CST or RIG, while RIG shows non-deterministic and often inflated results for large projects. It also examines BF trends over five years and analyzes directory-level BF to guide hiring and knowledge-transfer strategies, highlighting scalability limitations of existing tools for large repositories and proposing directions for future work and validation using data-driven methods.

Abstract

A critical issue faced by open-source software projects is the risk of key personnel leaving the project. This risk is exacerbated in large projects that have been under development for a long time and experienced growth in their development teams. One way to quantify this risk is to measure the concentration of knowledge about the project among its developers. Formally known as the Bus Factor (BF) of a project and defined as 'the number of key developers who would need to be incapacitated to make a project unable to proceed'. Most of the proposed algorithms for BF calculation measure a developer's knowledge of a file based on the number of commits. In this work, we propose using other metrics like lines of code changes (LOCC) and cosine difference of lines of code (change-size-cos) to calculate the BF. We use these metrics for BF calculation for five open-source GitHub projects using the CST algorithm and the RIG algorithm, which is git-blame-based. Moreover, we calculate the BF on project sub-directories that have seen the most active development recently. Lastly, we compare the results of the two algorithms in accuracy, similarity in results, execution time, and trends in BF values over time.
Paper Structure (16 sections, 1 equation, 6 figures, 3 tables, 2 algorithms)

This paper contains 16 sections, 1 equation, 6 figures, 3 tables, 2 algorithms.

Figures (6)

  • Figure 1: GReMCat Software Framework.
  • Figure 2: Comparison between LOCC based CST, change-size-cos based CST and RIG bus factors for project directories.
  • Figure 3: Comparison between bus factor values for the combinations of the four CST metrics and the two data metrics.
  • Figure 4: Trend in project level bus factors for last five years for LOCC and change-size-cos based CST. The yellow bars represent the total number of developers. The dotted lines represent the percentage of BF developers from the total corresponding to the right y-axis.
  • Figure 5: Trend in directory level bus factors for last five years for LOCC and change-size-cos based CST. The yellow bars represent the total number of developers. The dotted lines represent the percentage of BF developers from the total corresponding to the right y-axis.
  • ...and 1 more figures