Table of Contents
Fetching ...

Human-AI Synergy in Agentic Code Review

Suzhen Zhong, Shayan Noei, Ying Zou, Bram Adams

Abstract

Code review is a critical software engineering practice where developers review code changes before integration to ensure code quality, detect defects, and improve maintainability. In recent years, AI agents that can understand code context, plan review actions, and interact with development environments have been increasingly integrated into the code review process. However, there is limited empirical evidence to compare the effectiveness of AI agents and human reviewers in collaborative workflows. To address this gap, we conduct a large-scale empirical analysis of 278,790 code review conversations across 300 open-source GitHub projects. In our study, we aim to compare the feedback differences provided by human reviewers and AI agents. We investigate human-AI collaboration patterns in review conversations to understand how interaction shapes review outcomes. Moreover, we analyze the adoption of code suggestions provided by human reviewers and AI agents into the codebase and how adopted suggestions change code quality. We find that human reviewers provide additional feedback than AI agents, including understanding, testing, and knowledge transfer. Human reviewers exchange 11.8% more rounds when reviewing AI-generated code than human-written code. Moreover, code suggestions made by AI agents are adopted into the codebase at a significantly lower rate than suggestions proposed by human reviewers. Over half of unadopted suggestions from AI agents are either incorrect or addressed through alternative fixes by developers. When adopted, suggestions provided by AI agents produce significantly larger increases in code complexity and code size than suggestions provided by human reviewers. Our findings suggest that while AI agents can scale defect screening, human oversight remains critical for ensuring suggestion quality and providing contextual feedback that AI agents lack.

Human-AI Synergy in Agentic Code Review

Abstract

Code review is a critical software engineering practice where developers review code changes before integration to ensure code quality, detect defects, and improve maintainability. In recent years, AI agents that can understand code context, plan review actions, and interact with development environments have been increasingly integrated into the code review process. However, there is limited empirical evidence to compare the effectiveness of AI agents and human reviewers in collaborative workflows. To address this gap, we conduct a large-scale empirical analysis of 278,790 code review conversations across 300 open-source GitHub projects. In our study, we aim to compare the feedback differences provided by human reviewers and AI agents. We investigate human-AI collaboration patterns in review conversations to understand how interaction shapes review outcomes. Moreover, we analyze the adoption of code suggestions provided by human reviewers and AI agents into the codebase and how adopted suggestions change code quality. We find that human reviewers provide additional feedback than AI agents, including understanding, testing, and knowledge transfer. Human reviewers exchange 11.8% more rounds when reviewing AI-generated code than human-written code. Moreover, code suggestions made by AI agents are adopted into the codebase at a significantly lower rate than suggestions proposed by human reviewers. Over half of unadopted suggestions from AI agents are either incorrect or addressed through alternative fixes by developers. When adopted, suggestions provided by AI agents produce significantly larger increases in code complexity and code size than suggestions provided by human reviewers. Our findings suggest that while AI agents can scale defect screening, human oversight remains critical for ensuring suggestion quality and providing contextual feedback that AI agents lack.
Paper Structure (15 sections, 4 equations, 9 figures, 6 tables)

This paper contains 15 sections, 4 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Example of an AI agent (GitHub Copilot) reviewing a hunk. A hunk is a block of changes in the code, displaying lines added (+) and removed (-). The AI agent provides feedback with natural language and code suggestions (proposed modifications enclosed in triple backticks with the suggestion tag) to fix a typo.
  • Figure 2: Overview of our approach.
  • Figure 3: Prompt for labelling feedback types.
  • Figure 4: Distribution of feedback categories across the four review categories: HRH (Human reviews Human-written code), HRA (Human reviews Agent-generated code), ARH (Agent reviews Human-written code), and ARA (Agent reviews Agent-generated code).
  • Figure 5: Results of the Scott-Knott ESD test on Comment-to-Code Density (CD) across review categories. Categories within the same group (G1, G2, or G3) exhibit no statistically significant difference in CD.
  • ...and 4 more figures