Table of Contents
Fetching ...

Comprehension-Performance Gap in GenAI-Assisted Brownfield Programming: A Replication and Extension

Yunhan Qiao, Christopher Hundhausen, Summit Haque, Md Istiak Hossain Shihab

TL;DR

This study investigates how GenAI coding assistants, specifically GitHub Copilot, affect code comprehension during brownfield maintenance. By replicating and extending a prior within-subjects design, it measures both performance (task time and tests passed) and comprehension after two feature-implementation tasks, with and without Copilot. The key finding is a comprehension-performance gap: Copilot substantially boosts productivity but does not improve understanding of the legacy code, and comprehension does not correlate with performance in either condition. These results have implications for programming education and GenAI tool design, suggesting a shift toward fostering system-level thinking and dedicated comprehension modes in GenAI assistants.

Abstract

Code comprehension is essential for brownfield programming tasks, in which developers maintain and enhance legacy code bases. Generative AI (GenAI) coding assistants such as GitHub Copilot have been shown to improve developer productivity, but their impact on code understanding is less clear. We replicate and extend a previous study by exploring both performance and comprehension in GenAI-assisted brownfield programming tasks. In a within-subjects experimental study, 18 computer science graduate students completed feature implementation tasks with and without Copilot. Results show that Copilot significantly reduced task time and increased the number of test cases passed. However, comprehension scores did not differ across conditions, revealing a comprehension-performance gap: participants passed more test cases with Copilot, but did not demonstrate greater understanding of the legacy codebase. Moreover, we failed to find a correlation between comprehension and task performance. These findings suggest that while GenAI tools can accelerate programming progress in a legacy codebase, such progress may come without an improved understanding of that codebase. We consider the implications of these findings for programming education and GenAI tool design.

Comprehension-Performance Gap in GenAI-Assisted Brownfield Programming: A Replication and Extension

TL;DR

This study investigates how GenAI coding assistants, specifically GitHub Copilot, affect code comprehension during brownfield maintenance. By replicating and extending a prior within-subjects design, it measures both performance (task time and tests passed) and comprehension after two feature-implementation tasks, with and without Copilot. The key finding is a comprehension-performance gap: Copilot substantially boosts productivity but does not improve understanding of the legacy code, and comprehension does not correlate with performance in either condition. These results have implications for programming education and GenAI tool design, suggesting a shift toward fostering system-level thinking and dedicated comprehension modes in GenAI assistants.

Abstract

Code comprehension is essential for brownfield programming tasks, in which developers maintain and enhance legacy code bases. Generative AI (GenAI) coding assistants such as GitHub Copilot have been shown to improve developer productivity, but their impact on code understanding is less clear. We replicate and extend a previous study by exploring both performance and comprehension in GenAI-assisted brownfield programming tasks. In a within-subjects experimental study, 18 computer science graduate students completed feature implementation tasks with and without Copilot. Results show that Copilot significantly reduced task time and increased the number of test cases passed. However, comprehension scores did not differ across conditions, revealing a comprehension-performance gap: participants passed more test cases with Copilot, but did not demonstrate greater understanding of the legacy codebase. Moreover, we failed to find a correlation between comprehension and task performance. These findings suggest that while GenAI tools can accelerate programming progress in a legacy codebase, such progress may come without an improved understanding of that codebase. We consider the implications of these findings for programming education and GenAI tool design.

Paper Structure

This paper contains 39 sections, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Task 1 completion time by condition
  • Figure 2: Total tests passed by condition
  • Figure 3: Mean percentage of time spent in each activity category, by condition
  • Figure 4: Mean percentage of time spent in each code writing activity, by condition, showing a shift from almost exclusively manual code entry without Copilot to a mix of coding methods with Copilot assistance
  • Figure 5: Programming workflow networks comparing activity transitions without Copilot (left) and with Copilot (right). Node size indicates time spent on each activity, while arrow thickness shows transition frequency. Frequently connected activities appear closer together, revealing the emergence of a GenAI-mediated prompt→response→implement cycle with Copilot.
  • ...and 3 more figures