Table of Contents
Fetching ...

Predicting Open Source Software Sustainability with Deep Temporal Neural Hierarchical Architectures and Explainable AI

S M Rakib Ul Karim, Wenyi Lu, Enock Kasaadha, Sean Goggins

TL;DR

This work introduces a hierarchical temporal framework to predict OSS sustainability lifecycle stages by jointly modeling 24-month activity sequences and engineered tabular features, routing predictions through a Stage-1 gate, a Heavy Transformer+MLP plus a Light MLP, and a Club-Fed Expert for minority classes. The approach achieves high overall accuracy ($= 94.08\%$) and balanced performance, with attribution analyses showing sustained-contribution and community dynamics as the primary signals and a pronounced recency effect in temporal predictions. Explainability is embedded via SHAP and Integrated Gradients, enabling category-level interpretations and ablation validation, thereby supporting actionable insights for maintainers and funders. The results highlight the central role of continuous contribution and prompt maintenance, offer scalable ecosystem-monitoring capabilities, and outline future directions for broader feature modalities and cross-domain generalization.

Abstract

Open Source Software (OSS) projects follow diverse lifecycle trajectories shaped by evolving patterns of contribution, coordination, and community engagement. Understanding these trajectories is essential for stakeholders seeking to assess project organization and health at scale. However, prior work has largely relied on static or aggregated metrics, such as project age or cumulative activity, providing limited insight into how OSS sustainability unfolds over time. In this paper, we propose a hierarchical predictive framework that models OSS projects as belonging to distinct lifecycle stages grounded in established socio-technical categorizations of OSS development. Rather than treating sustainability solely as project longevity, these lifecycle stages operationalize sustainability as a multidimensional construct integrating contribution activity, community participation, and maintenance dynamics. The framework combines engineered tabular indicators with 24-month temporal activity sequences and employs a multi-stage classification pipeline to distinguish lifecycle stages associated with different coordination and participation regimes. To support transparency, we incorporate explainable AI techniques to examine the relative contribution of feature categories to model predictions. Evaluated on a large corpus of OSS repositories, the proposed approach achieves over 94\% overall accuracy in lifecycle stage classification. Attribution analyses consistently identify contribution activity and community-related features as dominant signals, highlighting the central role of collective participation dynamics.

Predicting Open Source Software Sustainability with Deep Temporal Neural Hierarchical Architectures and Explainable AI

TL;DR

This work introduces a hierarchical temporal framework to predict OSS sustainability lifecycle stages by jointly modeling 24-month activity sequences and engineered tabular features, routing predictions through a Stage-1 gate, a Heavy Transformer+MLP plus a Light MLP, and a Club-Fed Expert for minority classes. The approach achieves high overall accuracy () and balanced performance, with attribution analyses showing sustained-contribution and community dynamics as the primary signals and a pronounced recency effect in temporal predictions. Explainability is embedded via SHAP and Integrated Gradients, enabling category-level interpretations and ablation validation, thereby supporting actionable insights for maintainers and funders. The results highlight the central role of continuous contribution and prompt maintenance, offer scalable ecosystem-monitoring capabilities, and outline future directions for broader feature modalities and cross-domain generalization.

Abstract

Open Source Software (OSS) projects follow diverse lifecycle trajectories shaped by evolving patterns of contribution, coordination, and community engagement. Understanding these trajectories is essential for stakeholders seeking to assess project organization and health at scale. However, prior work has largely relied on static or aggregated metrics, such as project age or cumulative activity, providing limited insight into how OSS sustainability unfolds over time. In this paper, we propose a hierarchical predictive framework that models OSS projects as belonging to distinct lifecycle stages grounded in established socio-technical categorizations of OSS development. Rather than treating sustainability solely as project longevity, these lifecycle stages operationalize sustainability as a multidimensional construct integrating contribution activity, community participation, and maintenance dynamics. The framework combines engineered tabular indicators with 24-month temporal activity sequences and employs a multi-stage classification pipeline to distinguish lifecycle stages associated with different coordination and participation regimes. To support transparency, we incorporate explainable AI techniques to examine the relative contribution of feature categories to model predictions. Evaluated on a large corpus of OSS repositories, the proposed approach achieves over 94\% overall accuracy in lifecycle stage classification. Attribution analyses consistently identify contribution activity and community-related features as dominant signals, highlighting the central role of collective participation dynamics.
Paper Structure (71 sections, 28 equations, 5 figures, 3 tables)

This paper contains 71 sections, 28 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of the hierarchical OSS lifecycle prediction pipeline. Temporal and tabular features are processed through staged classifiers with confidence-based routing to produce final lifecycle stage predictions.
  • Figure 2: Confusion matrix for Hierarchical Pipeline predictions across four sustainability stages.
  • Figure 3: Normalized category importance heatmap showing relative influence of feature categories (rows) across different model architectures (columns), with values normalized to [0,1] scale.
  • Figure 4: Average category contribution bar chart displaying mean total Top Temporal Features by Importance Scores aggregated across all models, sorted by descending influence.
  • Figure 6: Detailed architectures of the four independently trained models. Panel A illustrates task-specific data preparation and splitting. Panel B shows the internal network architectures, including layer types and dimensions. Panel C summarizes the trained model artifacts and their respective outputs.