Table of Contents
Fetching ...

An Empirical Study of Java Code Improvements Based on Stack Overflow Answer Edits

In-on Wiratsin, Chaiyong Ragkhitwetsagul, Matheus Paixao, Denis De Sousa, Pongpop Lapvikai, Peter Haddawy

TL;DR

This paper conducts an empirical study of Java answer edits on Stack Overflow and their applicability to open-source projects. Leveraging SOTorrent, GitHub data, and a revision-aware code clone tool (Siamese+), the authors identify and validate code updates from SO revisions that can improve Java code in OSS. They analyze 140,840 edited SO Java answers and 10,673 GitHub Java projects, finding that 6.91% of SO answers were revised, with 49.30% of the latest SO code applicable to OSS and 391 useful updates (across 12 subtypes) validated as potentially beneficial, with 11 pulled into 4 merged PRs. The work demonstrates the practical utility of crowd-sourced answer edits for maintenance and automation in software engineering, and lays groundwork for revision-aware code-update recommendations in the GenAI era.

Abstract

Suboptimal code is prevalent in software systems. Developers often write low-quality code due to factors like technical knowledge gaps, insufficient experience, time pressure, management decisions, or personal factors. Once integrated, the accumulation of this suboptimal code leads to significant maintenance costs and technical debt. Developers frequently consult external knowledge bases, such as API documentation and Q&A websites like Stack Overflow (SO), to aid their programming tasks. SO's crowdsourced, collaborative nature has created a vast repository of programming knowledge. Its community-curated content is constantly evolving, with new answers posted or existing ones edited. In this paper, we present an empirical study of SO Java answer edits and their application to improving code in open-source projects. We use a modified code clone search tool to analyze SO code snippets with version history and apply it to open-source Java projects. This identifies outdated or unoptimized code and suggests improved alternatives. Analyzing 140,840 Java accepted answers from SOTorrent and 10,668 GitHub Java projects, we manually categorized SO answer edits and created pull requests to open-source projects with the suggested code improvements. Our results show that 6.91% of SO Java accepted answers have more than one revision (average of 2.82). Moreover, 49.24% of the code snippets in the answer edits are applicable to open-source projects, and 11 out of 36 proposed bug fixes based on these edits were accepted by the GitHub project maintainers.

An Empirical Study of Java Code Improvements Based on Stack Overflow Answer Edits

TL;DR

This paper conducts an empirical study of Java answer edits on Stack Overflow and their applicability to open-source projects. Leveraging SOTorrent, GitHub data, and a revision-aware code clone tool (Siamese+), the authors identify and validate code updates from SO revisions that can improve Java code in OSS. They analyze 140,840 edited SO Java answers and 10,673 GitHub Java projects, finding that 6.91% of SO answers were revised, with 49.30% of the latest SO code applicable to OSS and 391 useful updates (across 12 subtypes) validated as potentially beneficial, with 11 pulled into 4 merged PRs. The work demonstrates the practical utility of crowd-sourced answer edits for maintenance and automation in software engineering, and lays groundwork for revision-aware code-update recommendations in the GenAI era.

Abstract

Suboptimal code is prevalent in software systems. Developers often write low-quality code due to factors like technical knowledge gaps, insufficient experience, time pressure, management decisions, or personal factors. Once integrated, the accumulation of this suboptimal code leads to significant maintenance costs and technical debt. Developers frequently consult external knowledge bases, such as API documentation and Q&A websites like Stack Overflow (SO), to aid their programming tasks. SO's crowdsourced, collaborative nature has created a vast repository of programming knowledge. Its community-curated content is constantly evolving, with new answers posted or existing ones edited. In this paper, we present an empirical study of SO Java answer edits and their application to improving code in open-source projects. We use a modified code clone search tool to analyze SO code snippets with version history and apply it to open-source Java projects. This identifies outdated or unoptimized code and suggests improved alternatives. Analyzing 140,840 Java accepted answers from SOTorrent and 10,668 GitHub Java projects, we manually categorized SO answer edits and created pull requests to open-source projects with the suggested code improvements. Our results show that 6.91% of SO Java accepted answers have more than one revision (average of 2.82). Moreover, 49.24% of the code snippets in the answer edits are applicable to open-source projects, and 11 out of 36 proposed bug fixes based on these edits were accepted by the GitHub project maintainers.

Paper Structure

This paper contains 34 sections, 2 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: An example of the code answer edit on an SO accepted answer, which optimises the code by moving the port setup to the @Before method.
  • Figure 2: Overview of our exploratory study. We divide our study into three main steps. In step 1, we search for the optimised configuration of the Siamese code clone search tool for locating Stack Overflow clones based on the dataset of Stack Overflow--OSS Java projects Ragkhitwetsagul2021. In step 2, we perform a code clone search between Stack Overflow code answer revisions (from the SOTorrent dataset) and GitHub projects. The clone search results and the manual investigation of the results are used to answer RQ1 and RQ2. In step 3, the classified bug-fixing answers are used to create pull requests to their associated GitHub projects (answer to RQ3).
  • Figure 3: GitHub project selection criteria based on the distributions of a number of stars, watchers, and forks.
  • Figure 4: Example of a pull request submitted as part of the process in answering RQ3
  • Figure 5: Distribution of the metrics of collected GitHub projects
  • ...and 9 more figures