The Impact of Large Language Models (LLMs) on Code Review Process
Antonio Collante, Samuel Abedu, SayedHassan Khatoonabadi, Ahmad Abdellatif, Ebube Alor, Emad Shihab
TL;DR
This paper investigates the phase-specific impact of GPT-based assistance on GitHub pull request workflows by building a dataset of $25{,}473$ PRs from $9{,}254$ repositories and labeling GPT-assisted instances through a semi-automated heuristic process. Using a combination of Manhattan-distance alignment and a two-stage event labeling scheme, the study applies a multiple linear regression model and the Mann-Whitney U test to compare GPT-assisted and non-assisted PRs across PR phases. Findings show GPT-assisted PRs reduce median merge time from $23$ hours to $9$ hours (approximately a $61\%$ improvement), with the most pronounced phase gains in Review ($1$h vs $3$h) and Waiting for Changes ($3$h vs $24$h), while the Change phase shows minimal difference. The work reveals GPT is predominantly used for enhancements, bug fixes, and documentation, suggesting GPT acts as a targeted augmentation tool to accelerate iterative code review tasks, offering practical guidance for integrating AI into software development while highlighting areas for further methodological and validity enhancements.
Abstract
Large language models (LLMs) have recently gained prominence in the field of software development, significantly boosting productivity and simplifying teamwork. Although prior studies have examined task-specific applications, the phase-specific effects of LLM assistance on the efficiency of code review processes remain underexplored. This research investigates the effect of GPT on GitHub pull request (PR) workflows, with a focus on reducing resolution time, optimizing phase-specific performance, and assisting developers. We curated a dataset of 25,473 PRs from 9,254 GitHub projects and identified GPT-assisted PRs using a semi-automated heuristic approach that combines keyword-based detection, regular expression filtering, and manual verification until achieving 95% labeling accuracy. We then applied statistical modeling, including multiple linear regression and Mann-Whitney U test, to evaluate differences between GPT-assisted and non-assisted PRs, both at the overall resolution level and across distinct review phases. Our research has revealed that early adoption of GPT can substantially boost the effectiveness of the PR process, leading to considerable time savings at various stages. Our findings suggest that GPT-assisted PRs reduced median resolution time by more than 60% (9 hours compared to 23 hours for non-assisted PRs). We discovered that utilizing GPT can reduce the review time by 33% and the waiting time before acceptance by 87%. Analyzing a sample dataset of 300 GPT-assisted PRs, we discovered that developers predominantly use GPT for code optimization (60%), bug fixing (26%), and documentation updates (12%). This research sheds light on the impact of the GPT model on the code review process, offering actionable insights for software teams seeking to enhance workflows and promote seamless collaboration.
