Table of Contents
Fetching ...

Sources of Underproduction in Open Source Software

Kaylea Champion, Benjamin Mako Hill

TL;DR

The paper addresses underproduction in open source software by testing social and technical correlates within the Debian packaging ecosystem using Champion & Hill's underproduction measure and a suite of predictors. It employs four logistic regression models to relate package age, language age, contributor activity, maintainer dynamics, team organization, and collaboration-network metrics to underproduction, revealing that older software and older languages increase risk, while simply increasing contributors does not reduce risk. Notably, maintenance teams offer little protection in the full model, while central collaborators in bug networks are more involved with underproduced packages, and betweenness centrality shows no clear effect. The findings underscore the complexity of aligning supply with user demand in FLOSS and suggest practitioners should focus on stable, dedicated maintenance and cross-package visibility rather than purely expanding contributor counts or relying on team-based approaches.

Abstract

Because open source software relies on individuals who select their own tasks, it is often underproduced -- a term used by software engineering researchers to describe when a piece of software's relative quality is lower than its relative importance. We examine the social and technical factors associated with underproduction through a comparison of software packaged by the Debian GNU/Linux community. We test a series of hypotheses developed from a reading of prior research in software engineering. Although we find that software age and programming language age offer a partial explanation for variation in underproduction, we were surprised to find that the association between underproduction and package age is weaker at high levels of programming language age. With respect to maintenance efforts, we find that additional resources are not always tied to better outcomes. In particular, having higher numbers of contributors is associated with higher underproduction risk. Also, contrary to our expectations, maintainer turnover and maintenance by a declared team are not associated with lower rates of underproduction. Finally, we find that the people working on bugs in underproduced packages tend to be those who are more central to the community's collaboration network structure, although contributors' betweenness centrality (often associated with brokerage in social networks) is not associated with underproduction.

Sources of Underproduction in Open Source Software

TL;DR

The paper addresses underproduction in open source software by testing social and technical correlates within the Debian packaging ecosystem using Champion & Hill's underproduction measure and a suite of predictors. It employs four logistic regression models to relate package age, language age, contributor activity, maintainer dynamics, team organization, and collaboration-network metrics to underproduction, revealing that older software and older languages increase risk, while simply increasing contributors does not reduce risk. Notably, maintenance teams offer little protection in the full model, while central collaborators in bug networks are more involved with underproduced packages, and betweenness centrality shows no clear effect. The findings underscore the complexity of aligning supply with user demand in FLOSS and suggest practitioners should focus on stable, dedicated maintenance and cross-package visibility rather than purely expanding contributor counts or relying on team-based approaches.

Abstract

Because open source software relies on individuals who select their own tasks, it is often underproduced -- a term used by software engineering researchers to describe when a piece of software's relative quality is lower than its relative importance. We examine the social and technical factors associated with underproduction through a comparison of software packaged by the Debian GNU/Linux community. We test a series of hypotheses developed from a reading of prior research in software engineering. Although we find that software age and programming language age offer a partial explanation for variation in underproduction, we were surprised to find that the association between underproduction and package age is weaker at high levels of programming language age. With respect to maintenance efforts, we find that additional resources are not always tied to better outcomes. In particular, having higher numbers of contributors is associated with higher underproduction risk. Also, contrary to our expectations, maintainer turnover and maintenance by a declared team are not associated with lower rates of underproduction. Finally, we find that the people working on bugs in underproduced packages tend to be those who are more central to the community's collaboration network structure, although contributors' betweenness centrality (often associated with brokerage in social networks) is not associated with underproduction.
Paper Structure (24 sections, 3 equations, 5 figures, 1 table)

This paper contains 24 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: A conceptual diagram locating underproduction in open source software in relation to quality and importance, reproduced from Champion and Hill champion_underproduction_2021.
  • Figure 2: A piece of the free/libre open source software supply chain. Software is typically developed "upstream", and then numerous software programs are packaged and integrated by Debian developers before being distributed as part of an operating system or using package management tools. Users may also directly install software from source files or precompiled binaries without the benefit of a package manager (not shown).
  • Figure 3: Visualizing package age based on when the package was added to Debian, with a generalized additive model (GAM) line to indicate a moving average.
  • Figure 4: Violin plot of our data distribution broken down by the most commonly appearing languages. See Table I for models which test the relationship between language age and underproduction. This visualization contains data for 2,280 packages. On 135 occasions, the same package appears multiple times because it was consistently tagged as having been implemented in more than one language.
  • Figure 5: This visualization shows predicted underproduction probability from model M4 for two prototypical packages of different programming language ages where package age varies as shown along the $x$-axis. The package shown in blue is 25 years old, corresponding to a package written in a language as old as Java, while the package shown in red is 48 years old, corresponding to a package written in a language as old as C. The gray ribbon shows a 95% confidence interval around the prediction.