Table of Contents
Fetching ...

LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering

Jiessie Tie, Bingsheng Yao, Tianshi Li, Syed Ishtiaque Ahmed, Dakuo Wang, Shurui Zhou

TL;DR

The cases where ChatGPT failed, their root causes, and the corresponding mitigation solutions used by users are identified to contribute to the overall understanding and strategies for human-AI interaction on SE tasks.

Abstract

Software engineers are integrating AI assistants into their workflows to enhance productivity and reduce cognitive strain. However, experiences vary significantly, with some engineers finding large language models (LLMs), like ChatGPT, beneficial, while others consider them counterproductive. Researchers also found that ChatGPT's answers included incorrect information. Given the fact that LLMs are still imperfect, it is important to understand how to best incorporate LLMs into the workflow for software engineering (SE) task completion. Therefore, we conducted an observational study with 22 participants using ChatGPT as a coding assistant in a non-trivial SE task to understand the practices, challenges, and opportunities for using LLMs for SE tasks. We identified the cases where ChatGPT failed, their root causes, and the corresponding mitigation solutions used by users. These findings contribute to the overall understanding and strategies for human-AI interaction on SE tasks. Our study also highlights future research and tooling support directions.

LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering

TL;DR

The cases where ChatGPT failed, their root causes, and the corresponding mitigation solutions used by users are identified to contribute to the overall understanding and strategies for human-AI interaction on SE tasks.

Abstract

Software engineers are integrating AI assistants into their workflows to enhance productivity and reduce cognitive strain. However, experiences vary significantly, with some engineers finding large language models (LLMs), like ChatGPT, beneficial, while others consider them counterproductive. Researchers also found that ChatGPT's answers included incorrect information. Given the fact that LLMs are still imperfect, it is important to understand how to best incorporate LLMs into the workflow for software engineering (SE) task completion. Therefore, we conducted an observational study with 22 participants using ChatGPT as a coding assistant in a non-trivial SE task to understand the practices, challenges, and opportunities for using LLMs for SE tasks. We identified the cases where ChatGPT failed, their root causes, and the corresponding mitigation solutions used by users. These findings contribute to the overall understanding and strategies for human-AI interaction on SE tasks. Our study also highlights future research and tooling support directions.

Paper Structure

This paper contains 25 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Task Breakdown: A) Insert profile picture; B) Link email; C) Create division with headings/widgets: C1. Add visualization; D) Create division with headings/widgets: D1. Insert table; E) Insert footer division; F) Format side-by-side division: F1. Insert headings/subtitles; G) Insert division with form/buttons; H) Implement pop-up on button click; I) Implement form with local file saves and alert on submission.
  • Figure 2: Workflow of user interaction with ChatGPT: focus on failure appearance and mitigation strategies.
  • Figure 3: Distribution of Response Lengths for Successful vs. Unsuccessful Cases
  • Figure 4: Relationship between causes, failures, and mitigations.
  • Figure 5: Participant's ratings for (a) ChatGPT's for helpfulness; and (b) comprehension of ChatGPT's responses
  • ...and 1 more figures