Table of Contents
Fetching ...

What Types of Code Review Comments Do Developers Most Frequently Resolve?

Saul Goldman, Hong Yi Lin, Jirat Pasuksmit, Patanamon Thongtanunam, Kla Tantithamthavorn, Zhe Wang, Ray Zhang, Ali Behnaz, Fan Jiang, Michael Siers, Ryan Jiang, Mike Buller, Minwoo Jeong, Ming Wu

TL;DR

The paper addresses which code review comment types are actionable by examining human vs. LLM-generated feedback and their impact on subsequent changes. It develops a six-category taxonomy (readability, bugs, maintainability, design, no issue, other) and uses an LLM-as-a-Judge to classify comments across Atlassian internal and OSS projects, linking comment types to resolution rates. Key findings show LLMs and humans emphasize different issues depending on context, with readability, bugs, and maintainability comments more likely to be resolved than design comments, highlighting complementarities and suggesting improvements in LLM-driven code review tools. The work informs practical tool design by stressing balance and clarity of LLM-generated feedback to maximize actionable insights in real-world workflows.

Abstract

Large language model (LLM)-powered code review automation tools have been introduced to generate code review comments. However, not all generated comments will drive code changes. Understanding what types of generated review comments are likely to trigger code changes is crucial for identifying those that are actionable. In this paper, we set out to investigate (1) the types of review comments written by humans and LLMs, and (2) the types of generated comments that are most frequently resolved by developers. To do so, we developed an LLM-as-a-Judge to automatically classify review comments based on our own taxonomy of five categories. Our empirical study confirms that (1) the LLM reviewer and human reviewers exhibit distinct strengths and weaknesses depending on the project context, and (2) readability, bugs, and maintainability-related comments had higher resolution rates than those focused on code design. These results suggest that a substantial proportion of LLM-generated comments are actionable and can be resolved by developers. Our work highlights the complementarity between LLM and human reviewers and offers suggestions to improve the practical effectiveness of LLM-powered code review tools.

What Types of Code Review Comments Do Developers Most Frequently Resolve?

TL;DR

The paper addresses which code review comment types are actionable by examining human vs. LLM-generated feedback and their impact on subsequent changes. It develops a six-category taxonomy (readability, bugs, maintainability, design, no issue, other) and uses an LLM-as-a-Judge to classify comments across Atlassian internal and OSS projects, linking comment types to resolution rates. Key findings show LLMs and humans emphasize different issues depending on context, with readability, bugs, and maintainability comments more likely to be resolved than design comments, highlighting complementarities and suggesting improvements in LLM-driven code review tools. The work informs practical tool design by stressing balance and clarity of LLM-generated feedback to maximize actionable insights in real-world workflows.

Abstract

Large language model (LLM)-powered code review automation tools have been introduced to generate code review comments. However, not all generated comments will drive code changes. Understanding what types of generated review comments are likely to trigger code changes is crucial for identifying those that are actionable. In this paper, we set out to investigate (1) the types of review comments written by humans and LLMs, and (2) the types of generated comments that are most frequently resolved by developers. To do so, we developed an LLM-as-a-Judge to automatically classify review comments based on our own taxonomy of five categories. Our empirical study confirms that (1) the LLM reviewer and human reviewers exhibit distinct strengths and weaknesses depending on the project context, and (2) readability, bugs, and maintainability-related comments had higher resolution rates than those focused on code design. These results suggest that a substantial proportion of LLM-generated comments are actionable and can be resolved by developers. Our work highlights the complementarity between LLM and human reviewers and offers suggestions to improve the practical effectiveness of LLM-powered code review tools.

Paper Structure

This paper contains 10 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: An overview of our research methodology to answer the research questions.
  • Figure 2: The Main Prompt Template for Review Comment Classification.
  • Figure 3: (RQ1-1) The distribution of human-written vs LLM-Generated code review comment types for Atlassian's internal projects.
  • Figure 4: (RQ1-2) The distribution of human-written vs LLM-Generated code review comment types for OSS projects.
  • Figure 5: (RQ2) The percentage of code resolution of the LLM-generated comments for each comment type at Atlassian.