Table of Contents
Fetching ...

Reproducibility of Issues Reported in Stack Overflow Questions: Challenges, Impact & Estimation

Saikat Mondal, Banani Roy

TL;DR

This work investigates the reproducibility of issues described in Stack Overflow questions, validating a catalog of reproducibility challenges with practitioners and predicting reproducibility from code-based features. Through a two-phase study—an online survey of 53 practitioners and ML modeling on 357 Java questions with nine code-derived features—the authors demonstrate strong practitioner agreement with the catalog, identify critical challenges that block reproducibility, and achieve robust predictive performance (e.g., up to 84.5% precision and 82.8% F1) across five classifiers. SHAP analyses highlight LOC, presence of main/method/class, and parsability as key predictors, while chi-square tests corroborate the statistical relevance of several features. Generalization to 100 C# questions further suggests cross-language applicability, supporting broader use, including automated tooling to guide question authors and speed up reproducible issue resolution. The study also proposes interactive tools (browser/IDE plugins, static analyzers) to detect and fill missing parts, with implications for improving answer quality and reducing time to help on Q&A platforms.

Abstract

Software developers often submit questions to technical Q&A sites like Stack Overflow (SO) to resolve code-level problems. In practice, they include example code snippets with questions to explain the programming issues. Existing research suggests that users attempt to reproduce the reported issues using given code snippets when answering questions. Unfortunately, such code snippets could not always reproduce the issues due to several unmet challenges that prevent questions from receiving appropriate and prompt solutions. One previous study investigated reproducibility challenges and produced a catalog. However, how the practitioners perceive this challenge catalog is unknown. Practitioners' perspectives are inevitable in validating these challenges and estimating their severity. This study first surveyed 53 practitioners to understand their perspectives on reproducibility challenges. We attempt to (a) see whether they agree with these challenges, (b) determine the impact of each challenge on answering questions, and (c) identify the need for tools to promote reproducibility. Survey results show that - (a) about 90% of the participants agree with the challenges, (b) "missing an important part of code" most severely hurt reproducibility, and (c) participants strongly recommend introducing automated tool support to promote reproducibility. Second, we extract \emph{nine} code-based features (e.g., LOC, compilability) and build five Machine Learning (ML) models to predict issue reproducibility. Early detection might help users improve code snippets and their reproducibility. Our models achieve 84.5% precision, 83.0% recall, 82.8% F1-score, and 82.8% overall accuracy, which are highly promising. Third, we systematically interpret the ML model and explain how code snippets with reproducible issues differ from those with irreproducible issues.

Reproducibility of Issues Reported in Stack Overflow Questions: Challenges, Impact & Estimation

TL;DR

This work investigates the reproducibility of issues described in Stack Overflow questions, validating a catalog of reproducibility challenges with practitioners and predicting reproducibility from code-based features. Through a two-phase study—an online survey of 53 practitioners and ML modeling on 357 Java questions with nine code-derived features—the authors demonstrate strong practitioner agreement with the catalog, identify critical challenges that block reproducibility, and achieve robust predictive performance (e.g., up to 84.5% precision and 82.8% F1) across five classifiers. SHAP analyses highlight LOC, presence of main/method/class, and parsability as key predictors, while chi-square tests corroborate the statistical relevance of several features. Generalization to 100 C# questions further suggests cross-language applicability, supporting broader use, including automated tooling to guide question authors and speed up reproducible issue resolution. The study also proposes interactive tools (browser/IDE plugins, static analyzers) to detect and fill missing parts, with implications for improving answer quality and reducing time to help on Q&A platforms.

Abstract

Software developers often submit questions to technical Q&A sites like Stack Overflow (SO) to resolve code-level problems. In practice, they include example code snippets with questions to explain the programming issues. Existing research suggests that users attempt to reproduce the reported issues using given code snippets when answering questions. Unfortunately, such code snippets could not always reproduce the issues due to several unmet challenges that prevent questions from receiving appropriate and prompt solutions. One previous study investigated reproducibility challenges and produced a catalog. However, how the practitioners perceive this challenge catalog is unknown. Practitioners' perspectives are inevitable in validating these challenges and estimating their severity. This study first surveyed 53 practitioners to understand their perspectives on reproducibility challenges. We attempt to (a) see whether they agree with these challenges, (b) determine the impact of each challenge on answering questions, and (c) identify the need for tools to promote reproducibility. Survey results show that - (a) about 90% of the participants agree with the challenges, (b) "missing an important part of code" most severely hurt reproducibility, and (c) participants strongly recommend introducing automated tool support to promote reproducibility. Second, we extract \emph{nine} code-based features (e.g., LOC, compilability) and build five Machine Learning (ML) models to predict issue reproducibility. Early detection might help users improve code snippets and their reproducibility. Our models achieve 84.5% precision, 83.0% recall, 82.8% F1-score, and 82.8% overall accuracy, which are highly promising. Third, we systematically interpret the ML model and explain how code snippets with reproducible issues differ from those with irreproducible issues.
Paper Structure (28 sections, 16 figures, 11 tables)

This paper contains 28 sections, 16 figures, 11 tables.

Figures (16)

  • Figure 1: An example footnote1 question of Stack Overflow that discusses a programming issue.
  • Figure 2: An example footnote2 question of Stack Overflow whose issue could not be reproduced due to mainly two unmet challenges -- (i) class/interface/method not found and (ii) important part of code missing.
  • Figure 3: An example footnote3 question of Stack Overflow whose issue could not be reproduced due to mainly three unmet challenges -- (i) external library not found, (ii) identifier/object type not found, and (iii) too short code snippet.
  • Figure 4: An example footnote4 question of Stack Overflow whose issue could not be reproduced due to mainly two unmet challenges -- (i) database/file/UI dependency and (ii) class/interface/method not found.
  • Figure 5: An example footnote5 question of Stack Overflow whose issue could not be reproduced due to mainly outdated code challenge.
  • ...and 11 more figures