Table of Contents
Fetching ...

How Do Java Developers Reuse StackOverflow Answers in Their GitHub Projects?

Juntong Chen, Yan Zhao, Na Meng

TL;DR

This paper addresses how StackOverflow answers are reused in Java projects on GitHub. It presents a hybrid detection pipeline combining PMD CPD-based clone detection, explicit answer-ID keyword searches, and manual inspection to precisely identify reused SO answers, applied to 130 threads across 357 GitHub projects. The authors find that reused answers tend to have higher scores, older ages, and longer code and explanations, with only 9% of cases being exact code copies; most reuse involves modifications or re-deriving solutions, and they classify reuse into a taxonomy of patterns. The work contributes a robust dataset, a repeatable replication pipeline, and actionable insights for SO answerers and tool builders, with an open-source data and code release to facilitate further cross-platform analyses.

Abstract

StackOverflow (SO) is a widely used question-and-answer (Q\&A) website for software developers and computer scientists. GitHub is an online development platform used for storing, tracking, and collaborating on software projects. Prior work relates the information mined from both platforms to link user accounts or compare developers' activities across platforms. However, not much work is done to characterize the SO answers reused by GitHub projects. For this paper, we did an empirical study by mining the SO answers reused by Java projects available on GitHub. We created a hybrid approach of clone detection, keyword-based search, and manual inspection, to identify the answer(s) actually leveraged by developers. Based on the identified answers, we further studied topics of the discussion threads, answer characteristics (e.g., scores, ages, code lengths, and text lengths), and developers' reuse practices. We observed that most reused answers offer programs to implement specific coding tasks. Among all analyzed SO discussion threads, the reused answers often have relatively higher scores, older ages, longer code, and longer text than unused answers. In only 9% of scenarios (40/430), developers fully copied answer code for reuse. In the remaining scenarios, they reused partial code or created brand new code from scratch. Our study characterized 130 SO discussion threads referred to by Java developers in 357 GitHub projects. Our empirical findings can guide SO answerers to provide better answers, and shed lights on future research related to SO and GitHub.

How Do Java Developers Reuse StackOverflow Answers in Their GitHub Projects?

TL;DR

This paper addresses how StackOverflow answers are reused in Java projects on GitHub. It presents a hybrid detection pipeline combining PMD CPD-based clone detection, explicit answer-ID keyword searches, and manual inspection to precisely identify reused SO answers, applied to 130 threads across 357 GitHub projects. The authors find that reused answers tend to have higher scores, older ages, and longer code and explanations, with only 9% of cases being exact code copies; most reuse involves modifications or re-deriving solutions, and they classify reuse into a taxonomy of patterns. The work contributes a robust dataset, a repeatable replication pipeline, and actionable insights for SO answerers and tool builders, with an open-source data and code release to facilitate further cross-platform analyses.

Abstract

StackOverflow (SO) is a widely used question-and-answer (Q\&A) website for software developers and computer scientists. GitHub is an online development platform used for storing, tracking, and collaborating on software projects. Prior work relates the information mined from both platforms to link user accounts or compare developers' activities across platforms. However, not much work is done to characterize the SO answers reused by GitHub projects. For this paper, we did an empirical study by mining the SO answers reused by Java projects available on GitHub. We created a hybrid approach of clone detection, keyword-based search, and manual inspection, to identify the answer(s) actually leveraged by developers. Based on the identified answers, we further studied topics of the discussion threads, answer characteristics (e.g., scores, ages, code lengths, and text lengths), and developers' reuse practices. We observed that most reused answers offer programs to implement specific coding tasks. Among all analyzed SO discussion threads, the reused answers often have relatively higher scores, older ages, longer code, and longer text than unused answers. In only 9% of scenarios (40/430), developers fully copied answer code for reuse. In the remaining scenarios, they reused partial code or created brand new code from scratch. Our study characterized 130 SO discussion threads referred to by Java developers in 357 GitHub projects. Our empirical findings can guide SO answerers to provide better answers, and shed lights on future research related to SO and GitHub.
Paper Structure (20 sections, 1 equation, 4 figures, 2 tables)

This paper contains 20 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: An exemplar SO discussion thread that contains one question post and multiple answer posts auto-completion
  • Figure 2: An exemplar Java file on GitHub that cites an SO thread and reuses some of the answer code github-autocomplete
  • Figure 3: Our taxonomy of SO threads based on the discussion topics
  • Figure 4: The PR-comparison among reused answers, unused answers, and all answers of the 130 threads