Table of Contents
Fetching ...

Predicting long time contributors with knowledge units of programming languages: an empirical study

Md Ahasanuzzaman, Gustavo A. Oliva, Ahmed E. Hassan

TL;DR

This study demonstrates that programming-language knowledge units (KUs) can effectively predict which developers will become long-time contributors (LTCs) in FLOSS projects. By constructing KULTC, a KU-based predictor, from five dimensions of KU features and using a random forest classifier, the authors achieve median AUCs of at least $0.75$ across LTC settings and consistently outperform the state-of-the-art BAOLTC baseline. Combining KU features with BAOLTC (KULTC+BAOLTC) further improves predictive accuracy, with median AUC up to $0.81$ and relative gains around $14.5 o16.5 ext{%}$; a cost-effective variant (KULTC_DEV_EXP+BAOLTC) also surpasses BAOLTC while reducing feature-engineering costs. SHAP analysis reveals that early developer KU-based expertise in the studied projects is the most influential predictor, highlighting the practical value of language-focused expertise for mentorship and resource allocation in FLOSS projects.

Abstract

Predicting potential long-time contributors (LTCs) early allows project maintainers to effectively allocate resources and mentoring to enhance their development and retention. Mapping programming language expertise to developers and characterizing projects in terms of how they use programming languages can help identify developers who are more likely to become LTCs. However, prior studies on predicting LTCs do not consider programming language skills. This paper reports an empirical study on the usage of knowledge units (KUs) of the Java programming language to predict LTCs. A KU is a cohesive set of key capabilities that are offered by one or more building blocks of a given programming language. We build a prediction model called KULTC, which leverages KU-based features along five different dimensions. We detect and analyze KUs from the studied 75 Java projects (353K commits and 168K pull requests) as well as 4,219 other Java projects in which the studied developers previously worked (1.7M commits). We compare the performance of KULTC with the state-of-the-art model, which we call BAOLTC. Even though KULTC focuses exclusively on the programming language perspective, KULTC achieves a median AUC of at least 0.75 and significantly outperforms BAOLTC. Combining the features of KULTC with the features of BAOLTC results in an enhanced model (KULTC+BAOLTC) that significantly outperforms BAOLTC with a normalized AUC improvement of 16.5%. Our feature importance analysis with SHAP reveals that developer expertise in the studied project is the most influential feature dimension for predicting LTCs. Finally, we develop a cost-effective model (KULTC_DEV_EXP+BAOLTC) that significantly outperforms BAOLTC. These encouraging results can be helpful to researchers who wish to further study the developers' engagement/retention to FLOSS projects or build models for predicting LTCs.

Predicting long time contributors with knowledge units of programming languages: an empirical study

TL;DR

This study demonstrates that programming-language knowledge units (KUs) can effectively predict which developers will become long-time contributors (LTCs) in FLOSS projects. By constructing KULTC, a KU-based predictor, from five dimensions of KU features and using a random forest classifier, the authors achieve median AUCs of at least across LTC settings and consistently outperform the state-of-the-art BAOLTC baseline. Combining KU features with BAOLTC (KULTC+BAOLTC) further improves predictive accuracy, with median AUC up to and relative gains around ; a cost-effective variant (KULTC_DEV_EXP+BAOLTC) also surpasses BAOLTC while reducing feature-engineering costs. SHAP analysis reveals that early developer KU-based expertise in the studied projects is the most influential predictor, highlighting the practical value of language-focused expertise for mentorship and resource allocation in FLOSS projects.

Abstract

Predicting potential long-time contributors (LTCs) early allows project maintainers to effectively allocate resources and mentoring to enhance their development and retention. Mapping programming language expertise to developers and characterizing projects in terms of how they use programming languages can help identify developers who are more likely to become LTCs. However, prior studies on predicting LTCs do not consider programming language skills. This paper reports an empirical study on the usage of knowledge units (KUs) of the Java programming language to predict LTCs. A KU is a cohesive set of key capabilities that are offered by one or more building blocks of a given programming language. We build a prediction model called KULTC, which leverages KU-based features along five different dimensions. We detect and analyze KUs from the studied 75 Java projects (353K commits and 168K pull requests) as well as 4,219 other Java projects in which the studied developers previously worked (1.7M commits). We compare the performance of KULTC with the state-of-the-art model, which we call BAOLTC. Even though KULTC focuses exclusively on the programming language perspective, KULTC achieves a median AUC of at least 0.75 and significantly outperforms BAOLTC. Combining the features of KULTC with the features of BAOLTC results in an enhanced model (KULTC+BAOLTC) that significantly outperforms BAOLTC with a normalized AUC improvement of 16.5%. Our feature importance analysis with SHAP reveals that developer expertise in the studied project is the most influential feature dimension for predicting LTCs. Finally, we develop a cost-effective model (KULTC_DEV_EXP+BAOLTC) that significantly outperforms BAOLTC. These encouraging results can be helpful to researchers who wish to further study the developers' engagement/retention to FLOSS projects or build models for predicting LTCs.
Paper Structure (38 sections, 1 equation, 9 figures, 4 tables)

This paper contains 38 sections, 1 equation, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Our metamodel for knowledge units (KUs).
  • Figure 4: The distribution of AUC of KULTC and BAOLTC across LTC settings.
  • Figure 5: An example illustrating SHAP values for a prediction model.
  • Figure 6: The distribution of the sum of absolute SHAP values for each KU-based feature dimension and their importance ranking in predicting LTCs (lower rank = better).
  • Figure 7: The distribution of AUC of KULTC, BAOLTC and the combined model KULTC+BAOLTC. Models are grouped according to their Scott-Knott ESD rank (lower rank = better).
  • ...and 4 more figures