Table of Contents
Fetching ...

Rethinking Code Review Workflows with LLM Assistance: An Empirical Study

Fannar Steinn Aðalsteinsson, Björn Borgar Magnússon, Mislav Milicevic, Adam Nirving Davidsson, Chih-Hong Cheng

TL;DR

The paper investigates how Large Language Models (LLMs) can be meaningfully integrated into code review (CR) workflows through a two-phase, real-world study at WirelessCar. It combines an exploratory field study with a field experiment that compares an AI-led co-reviewer mode and an on-demand interactive assistant, both powered by a retrieval-augmented generation (RAG) pipeline to provide context from diffs, code, and Jira tickets. Findings show AI-led reviews are generally preferred for large or unfamiliar pull requests, but adoption is influenced by trust, false positives, and latency, underscoring that AI should augment rather than replace human reviewers and should be tightly integrated into existing tools. The study yields practical design directions for context-aware, fast, and concise AI feedback embedded in developers’ environments (GitHub, IDEs, Slack) and highlights pre-review use as a promising enhancement to code quality upstream.

Abstract

Code reviews are a critical yet time-consuming aspect of modern software development, increasingly challenged by growing system complexity and the demand for faster delivery. This paper presents a study conducted at WirelessCar Sweden AB, combining an exploratory field study of current code review practices with a field experiment involving two variations of an LLM-assisted code review tool. The field study identifies key challenges in traditional code reviews, including frequent context switching, insufficient contextual information, and highlights both opportunities (e.g., automatic summarization of complex pull requests) and concerns (e.g., false positives and trust issues) in using LLMs. In the field experiment, we developed two prototype variations: one offering LLM-generated reviews upfront and the other enabling on-demand interaction. Both utilize a semantic search pipeline based on retrieval-augmented generation to assemble relevant contextual information for the review, thereby tackling the uncovered challenges. Developers evaluated both variations in real-world settings: AI-led reviews are overall more preferred, while still being conditional on the reviewers' familiarity with the code base, as well as on the severity of the pull request.

Rethinking Code Review Workflows with LLM Assistance: An Empirical Study

TL;DR

The paper investigates how Large Language Models (LLMs) can be meaningfully integrated into code review (CR) workflows through a two-phase, real-world study at WirelessCar. It combines an exploratory field study with a field experiment that compares an AI-led co-reviewer mode and an on-demand interactive assistant, both powered by a retrieval-augmented generation (RAG) pipeline to provide context from diffs, code, and Jira tickets. Findings show AI-led reviews are generally preferred for large or unfamiliar pull requests, but adoption is influenced by trust, false positives, and latency, underscoring that AI should augment rather than replace human reviewers and should be tightly integrated into existing tools. The study yields practical design directions for context-aware, fast, and concise AI feedback embedded in developers’ environments (GitHub, IDEs, Slack) and highlights pre-review use as a promising enhancement to code quality upstream.

Abstract

Code reviews are a critical yet time-consuming aspect of modern software development, increasingly challenged by growing system complexity and the demand for faster delivery. This paper presents a study conducted at WirelessCar Sweden AB, combining an exploratory field study of current code review practices with a field experiment involving two variations of an LLM-assisted code review tool. The field study identifies key challenges in traditional code reviews, including frequent context switching, insufficient contextual information, and highlights both opportunities (e.g., automatic summarization of complex pull requests) and concerns (e.g., false positives and trust issues) in using LLMs. In the field experiment, we developed two prototype variations: one offering LLM-generated reviews upfront and the other enabling on-demand interaction. Both utilize a semantic search pipeline based on retrieval-augmented generation to assemble relevant contextual information for the review, thereby tackling the uncovered challenges. Developers evaluated both variations in real-world settings: AI-led reviews are overall more preferred, while still being conditional on the reviewers' familiarity with the code base, as well as on the severity of the pull request.

Paper Structure

This paper contains 22 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Flowchart detailing the research workflow
  • Figure 2: Screenshot of the LLM-assisted code review interface in Mode B, reviewing a pull request from an open-source project available at https://github.com/ogen-go/ogen/pull/1440.
  • Figure 3: Agentic tool structure in Co-Reviewer mode.