Table of Contents
Fetching ...

Can GPT-4 Replicate Empirical Software Engineering Research?

Jenny T. Liang, Carmen Badea, Christian Bird, Robert DeLine, Denae Ford, Nicole Forsgren, Thomas Zimmermann

TL;DR

The paper investigates whether off-the-shelf GPT-4 can replicate seven quantitative empirical software engineering studies on a new dataset. It prompts GPT-4 to surface methodological assumptions, design an analysis pipeline as modular specifications, and generate code, evaluating outputs with a user study of software engineering researchers and a manual code review. Results show GPT-4 often identifies correct assumptions and provides a usable high-level structure for analysis plans, but it suffers from domain knowledge gaps and frequent low-level coding errors that hinder autonomous replication. The findings suggest GPT-4 can scaffold replication workflows and democratize data science in software teams, provided that domain knowledge is taught or supplemented with human oversight and provenance to build trust. Future work should focus on enhancing domain-specific guidance, building replication-focused training data, and developing interfaces that clearly connect outputs to the underlying methodology.

Abstract

Empirical software engineering research on production systems has brought forth a better understanding of the software engineering process for practitioners and researchers alike. However, only a small subset of production systems is studied, limiting the impact of this research. While software engineering practitioners could benefit from replicating research on their own data, this poses its own set of challenges, since performing replications requires a deep understanding of research methodologies and subtle nuances in software engineering data. Given that large language models (LLMs), such as GPT-4, show promise in tackling both software engineering- and science-related tasks, these models could help replicate and thus democratize empirical software engineering research. In this paper, we examine GPT-4's abilities to perform replications of empirical software engineering research on new data. We study their ability to surface assumptions made in empirical software engineering research methodologies, as well as their ability to plan and generate code for analysis pipelines on seven empirical software engineering papers. We perform a user study with 14 participants with software engineering research expertise, who evaluate GPT-4-generated assumptions and analysis plans (i.e., a list of module specifications) from the papers. We find that GPT-4 is able to surface correct assumptions, but struggles to generate ones that apply common knowledge about software engineering data. In a manual analysis of the generated code, we find that the GPT-4-generated code contains correct high-level logic, given a subset of the methodology. However, the code contains many small implementation-level errors, reflecting a lack of software engineering knowledge. Our findings have implications for leveraging LLMs for software engineering research as well as practitioner data scientists in software teams.

Can GPT-4 Replicate Empirical Software Engineering Research?

TL;DR

The paper investigates whether off-the-shelf GPT-4 can replicate seven quantitative empirical software engineering studies on a new dataset. It prompts GPT-4 to surface methodological assumptions, design an analysis pipeline as modular specifications, and generate code, evaluating outputs with a user study of software engineering researchers and a manual code review. Results show GPT-4 often identifies correct assumptions and provides a usable high-level structure for analysis plans, but it suffers from domain knowledge gaps and frequent low-level coding errors that hinder autonomous replication. The findings suggest GPT-4 can scaffold replication workflows and democratize data science in software teams, provided that domain knowledge is taught or supplemented with human oversight and provenance to build trust. Future work should focus on enhancing domain-specific guidance, building replication-focused training data, and developing interfaces that clearly connect outputs to the underlying methodology.

Abstract

Empirical software engineering research on production systems has brought forth a better understanding of the software engineering process for practitioners and researchers alike. However, only a small subset of production systems is studied, limiting the impact of this research. While software engineering practitioners could benefit from replicating research on their own data, this poses its own set of challenges, since performing replications requires a deep understanding of research methodologies and subtle nuances in software engineering data. Given that large language models (LLMs), such as GPT-4, show promise in tackling both software engineering- and science-related tasks, these models could help replicate and thus democratize empirical software engineering research. In this paper, we examine GPT-4's abilities to perform replications of empirical software engineering research on new data. We study their ability to surface assumptions made in empirical software engineering research methodologies, as well as their ability to plan and generate code for analysis pipelines on seven empirical software engineering papers. We perform a user study with 14 participants with software engineering research expertise, who evaluate GPT-4-generated assumptions and analysis plans (i.e., a list of module specifications) from the papers. We find that GPT-4 is able to surface correct assumptions, but struggles to generate ones that apply common knowledge about software engineering data. In a manual analysis of the generated code, we find that the GPT-4-generated code contains correct high-level logic, given a subset of the methodology. However, the code contains many small implementation-level errors, reflecting a lack of software engineering knowledge. Our findings have implications for leveraging LLMs for software engineering research as well as practitioner data scientists in software teams.
Paper Structure (81 sections, 6 figures, 3 tables)

This paper contains 81 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: An overview of the prompts used in the study. To answer RQ1, we used Prompt 1 to generate assumptions for each paper and evaluated them in a user study. To answer RQ2, we used Prompt 2 to create an analysis plan and evaluated it in a user study. We also used Prompt 3 to create the code modules of the analysis plan and evaluated it in a code review workshop. The prompts in the figure are only summaries of the actual ones; for the complete prompts, see the supplemental materials supplemental-materials.
  • Figure 2: Example GPT-4 generated assumption, given the methodology from fregnan2022first. Assumptions contain a name 1 and a description 2.
  • Figure 3: Example GPT-4 generated module, given the methodology from guzman2014sentiment. Modules contain a name 1, description 2, inputs 3, a description of outputs 4, and corresponding methodology text 5.
  • Figure 4: Example GPT-4 generated code, given the module specification from the analysis plan for pletea2014security. Generated code reads data from JSON file or database 1, runs additional logic, then outputs the result as a JSON object 2.
  • Figure 5: The distribution of participants' scoring of the GPT-4-generated assumptions by correctness (left), relevance (middle), and insightfulness (right) for all papers.
  • ...and 1 more figures