Table of Contents
Fetching ...

A Study on Developer Behaviors for Validating and Repairing LLM-Generated Code Using Eye Tracking and IDE Actions

Ningzhi Tang, Meng Chen, Zheng Ning, Aakash Bansal, Yu Huang, Collin McMillan, Toby Jia-Jun Li

TL;DR

The paper investigates how developers validate and repair Copilot-generated code and how awareness of code provenance affects their behavior. Using a lab study with 28 participants, the authors combine eye-tracking and IDE tracking via CodeGRITS, NASA-TLX workload measures, and semi-structured interviews across three Java tasks containing Copilot bugs. The study finds that provenance awareness improves bug-fixing performance but increases cognitive workload, and that LLM-generated code differs from human-written code in terms of readability, error types, and strategic reading patterns (e.g., more switching between code and comments). These results inform design guidelines for provenance-aware tools and interfaces that support effective human-LLM collaboration in software development. Overall, the work advances understanding of how to label, interpret, and manage AI-generated code within real developer workflows and points to future work incorporating professional developers and multimodal information.

Abstract

The increasing use of large language model (LLM)-powered code generation tools, such as GitHub Copilot, is transforming software engineering practices. This paper investigates how developers validate and repair code generated by Copilot and examines the impact of code provenance awareness during these processes. We conducted a lab study with 28 participants, who were tasked with validating and repairing Copilot-generated code in three software projects. Participants were randomly divided into two groups: one informed about the provenance of LLM-generated code and the other not. We collected data on IDE interactions, eye-tracking, cognitive workload assessments, and conducted semi-structured interviews. Our results indicate that, without explicit information, developers often fail to identify the LLM origin of the code. Developers generally employ similar validation and repair strategies for LLM-generated code, but exhibit behaviors such as frequent switching between code and comments, different attentional focus, and a tendency to delete and rewrite code. Being aware of the code's provenance led to improved performance, increased search efforts, more frequent Copilot usage, and higher cognitive workload. These findings enhance our understanding of how developers interact with LLM-generated code and carry implications for designing tools that facilitate effective human-LLM collaboration in software development.

A Study on Developer Behaviors for Validating and Repairing LLM-Generated Code Using Eye Tracking and IDE Actions

TL;DR

The paper investigates how developers validate and repair Copilot-generated code and how awareness of code provenance affects their behavior. Using a lab study with 28 participants, the authors combine eye-tracking and IDE tracking via CodeGRITS, NASA-TLX workload measures, and semi-structured interviews across three Java tasks containing Copilot bugs. The study finds that provenance awareness improves bug-fixing performance but increases cognitive workload, and that LLM-generated code differs from human-written code in terms of readability, error types, and strategic reading patterns (e.g., more switching between code and comments). These results inform design guidelines for provenance-aware tools and interfaces that support effective human-LLM collaboration in software development. Overall, the work advances understanding of how to label, interpret, and manage AI-generated code within real developer workflows and points to future work incorporating professional developers and multimodal information.

Abstract

The increasing use of large language model (LLM)-powered code generation tools, such as GitHub Copilot, is transforming software engineering practices. This paper investigates how developers validate and repair code generated by Copilot and examines the impact of code provenance awareness during these processes. We conducted a lab study with 28 participants, who were tasked with validating and repairing Copilot-generated code in three software projects. Participants were randomly divided into two groups: one informed about the provenance of LLM-generated code and the other not. We collected data on IDE interactions, eye-tracking, cognitive workload assessments, and conducted semi-structured interviews. Our results indicate that, without explicit information, developers often fail to identify the LLM origin of the code. Developers generally employ similar validation and repair strategies for LLM-generated code, but exhibit behaviors such as frequent switching between code and comments, different attentional focus, and a tendency to delete and rewrite code. Being aware of the code's provenance led to improved performance, increased search efforts, more frequent Copilot usage, and higher cognitive workload. These findings enhance our understanding of how developers interact with LLM-generated code and carry implications for designing tools that facilitate effective human-LLM collaboration in software development.
Paper Structure (31 sections, 6 figures, 2 tables)

This paper contains 31 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: A glance of how Copilot is invoked in our study setting, along with the collected IDE tracking and eye tracking data.
  • Figure 2: An example of errors in LLM-generated code in task ZooSystem: 6112 (Wrong Component) and 6125 (Parameter Value).
  • Figure 3: Common patterns of transitions among distinct behaviors across all participants. Patterns #1 and #2 are the top two most frequent patterns across all sequences. Patterns #3, #4, and #5 are the most frequent patterns that involve Invoking Copilot, Running for Output, and Employing Debugger, respectively.
  • Figure 4: The success rates of bug validating and repair of the Informed group and the Non-Informed group.
  • Figure 5: Comparison of the frequency of different types of developer behaviors between the Informed group (left) and the Non-Informed group (right).
  • ...and 1 more figures