Table of Contents
Fetching ...

Rocks Coding, Not Development--A Human-Centric, Experimental Evaluation of LLM-Supported SE Tasks

Wei Wang, Huilong Ning, Gaowei Zhang, Libo Liu, Yi Wang

TL;DR

The paper empirically evaluates ChatGPT's usefulness in software engineering tasks using a large-scale, controlled 2×2 experiment (109 participants) across coding puzzles and a real bug-fix task. It finds that ChatGPT reliably speeds up simple coding puzzles but provides limited benefits for typical development work, with no robust improvements in solution quality and only modest changes to perceived workload. The study also documents rich interaction patterns between developers and the AI, highlighting both collaborative workflows and traps where reliance on AI can hinder performance. These results suggest a future in which human developers collaborate with LLMs through guided interaction, prompts engineering, and education, rather than expecting AI to replace human software engineers any time soon.

Abstract

Recently, large language models (LLM) based generative AI has been gaining momentum for their impressive high-quality performances in multiple domains, particularly after the release of the ChatGPT. Many believe that they have the potential to perform general-purpose problem-solving in software development and replace human software developers. Nevertheless, there are in a lack of serious investigation into the capability of these LLM techniques in fulfilling software development tasks. In a controlled 2 x 2 between-subject experiment with 109 participants, we examined whether and to what degree working with ChatGPT was helpful in the coding task and typical software development task and how people work with ChatGPT. We found that while ChatGPT performed well in solving simple coding problems, its performance in supporting typical software development tasks was not that good. We also observed the interactions between participants and ChatGPT and found the relations between the interactions and the outcomes. Our study thus provides first-hand insights into using ChatGPT to fulfill software engineering tasks with real-world developers and motivates the need for novel interaction mechanisms that help developers effectively work with large language models to achieve desired outcomes.

Rocks Coding, Not Development--A Human-Centric, Experimental Evaluation of LLM-Supported SE Tasks

TL;DR

The paper empirically evaluates ChatGPT's usefulness in software engineering tasks using a large-scale, controlled 2×2 experiment (109 participants) across coding puzzles and a real bug-fix task. It finds that ChatGPT reliably speeds up simple coding puzzles but provides limited benefits for typical development work, with no robust improvements in solution quality and only modest changes to perceived workload. The study also documents rich interaction patterns between developers and the AI, highlighting both collaborative workflows and traps where reliance on AI can hinder performance. These results suggest a future in which human developers collaborate with LLMs through guided interaction, prompts engineering, and education, rather than expecting AI to replace human software engineers any time soon.

Abstract

Recently, large language models (LLM) based generative AI has been gaining momentum for their impressive high-quality performances in multiple domains, particularly after the release of the ChatGPT. Many believe that they have the potential to perform general-purpose problem-solving in software development and replace human software developers. Nevertheless, there are in a lack of serious investigation into the capability of these LLM techniques in fulfilling software development tasks. In a controlled 2 x 2 between-subject experiment with 109 participants, we examined whether and to what degree working with ChatGPT was helpful in the coding task and typical software development task and how people work with ChatGPT. We found that while ChatGPT performed well in solving simple coding problems, its performance in supporting typical software development tasks was not that good. We also observed the interactions between participants and ChatGPT and found the relations between the interactions and the outcomes. Our study thus provides first-hand insights into using ChatGPT to fulfill software engineering tasks with real-world developers and motivates the need for novel interaction mechanisms that help developers effectively work with large language models to achieve desired outcomes.
Paper Structure (38 sections, 8 figures, 2 tables)

This paper contains 38 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: The overall view of the experiment conditions and groups.
  • Figure 2: Two participants were in their experiment sessions, attempting to finish the tasks assigned to them.
  • Figure 3: The distributions of participants' efficiency in different experiment groups.
  • Figure 4: The distributions of participants' solution quality in different experiment groups.
  • Figure 5: The distributions of participants' subjective task load in different experiment groups.
  • ...and 3 more figures