Evaluating Large Language Models for Code Review
Umut Cihan, Arda İçöz, Vahid Haratian, Eray Tüzün
TL;DR
The paper tackles the reliability of large language models in automated code review by benchmarking GPT4o and Gemini 2.0 Flash on 492 AI-generated and 164 canonical Python blocks, measuring correctness judgments and improvement suggestions under prompts with and without problem descriptions. Results show moderate accuracy, with substantial gains when problem descriptions are provided, and variations across datasets, indicating limited readiness for full automation. The authors advocate a Human-in-the-loop code review framework to balance efficiency with reliability, enabling knowledge sharing and risk mitigation while enabling practitioners to tailor testing to their codebases. Overall, the work provides a rigorous, replication-friendly methodology for evaluating LLM-assisted code reviews and highlights practical paths for integrating AI reviews into real-world software development workflows.
Abstract
Context: Code reviews are crucial for software quality. Recent AI advances have allowed large language models (LLMs) to review and fix code; now, there are tools that perform these reviews. However, their reliability and accuracy have not yet been systematically evaluated. Objective: This study compares different LLMs' performance in detecting code correctness and suggesting improvements. Method: We tested GPT4o and Gemini 2.0 Flash on 492 AI generated code blocks of varying correctness, along with 164 canonical code blocks from the HumanEval benchmark. To simulate the code review task objectively, we expected LLMs to assess code correctness and improve the code if needed. We ran experiments with different configurations and reported on the results. Results: With problem descriptions, GPT4o and Gemini 2.0 Flash correctly classified code correctness 68.50% and 63.89% of the time, respectively, and corrected the code 67.83% and 54.26% of the time for the 492 code blocks of varying correctness. Without problem descriptions, performance declined. The results for the 164 canonical code blocks differed, suggesting that performance depends on the type of code. Conclusion: LLM code reviews can help suggest improvements and assess correctness, but there is a risk of faulty outputs. We propose a process that involves humans, called the "Human in the loop LLM Code Review" to promote knowledge sharing while mitigating the risk of faulty outputs.
