Table of Contents
Fetching ...

Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement

Haolin Jin, Huaming Chen

TL;DR

A systematic failure of LLMs in matching code to natural language requirements is uncovered, and a Fix-guided Verification Filter is proposed that treats the model proposed fix as executable counterfactual evidence, and validates the original and revised implementations using benchmark tests and spec-constrained augmented tests.

Abstract

Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks. Software engineers often rely on LLMs to verify if code implementation satisfy task requirements, thereby ensuring code robustness and accuracy. However, it remains unclear whether LLMs can reliably determine code against the given task descriptions, which is usually in a form of natural language specifications. In this paper, we uncover a systematic failure of LLMs in matching code to natural language requirements. Specifically, with widely adopted benchmarks and unified prompts design, we demonstrate that LLMs frequently misclassify correct code implementation as non-compliant or defective. Surprisingly, we find that more detailed prompt design, particularly with those requiring explanations and proposed corrections, leads to higher misjudgment rates, highlighting critical reliability issues for LLM-based code assistants. We further analyze the mechanisms driving these failures and evaluate the reliability of rationale-required judgments. Building on these findings, we propose a Fix-guided Verification Filter that treats the model proposed fix as executable counterfactual evidence, and validates the original and revised implementations using benchmark tests and spec-constrained augmented tests. Our results expose previously under-explored limitations in LLM-based code review capabilities, and provide practical guidance for integrating LLM-based reviewers with safeguards in automated review and development pipelines.

Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement

TL;DR

A systematic failure of LLMs in matching code to natural language requirements is uncovered, and a Fix-guided Verification Filter is proposed that treats the model proposed fix as executable counterfactual evidence, and validates the original and revised implementations using benchmark tests and spec-constrained augmented tests.

Abstract

Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks. Software engineers often rely on LLMs to verify if code implementation satisfy task requirements, thereby ensuring code robustness and accuracy. However, it remains unclear whether LLMs can reliably determine code against the given task descriptions, which is usually in a form of natural language specifications. In this paper, we uncover a systematic failure of LLMs in matching code to natural language requirements. Specifically, with widely adopted benchmarks and unified prompts design, we demonstrate that LLMs frequently misclassify correct code implementation as non-compliant or defective. Surprisingly, we find that more detailed prompt design, particularly with those requiring explanations and proposed corrections, leads to higher misjudgment rates, highlighting critical reliability issues for LLM-based code assistants. We further analyze the mechanisms driving these failures and evaluate the reliability of rationale-required judgments. Building on these findings, we propose a Fix-guided Verification Filter that treats the model proposed fix as executable counterfactual evidence, and validates the original and revised implementations using benchmark tests and spec-constrained augmented tests. Our results expose previously under-explored limitations in LLM-based code review capabilities, and provide practical guidance for integrating LLM-based reviewers with safeguards in automated review and development pipelines.
Paper Structure (46 sections, 1 equation, 9 figures, 4 tables)

This paper contains 46 sections, 1 equation, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Workflow of evaluating LLM code conformance on canonical and buggy solutions under three prompting modes, with outputs logged for downstream scoring.
  • Figure 2: The Full prompt template used in our experiments. It provides the requirement and code, then requests a three-step response.
  • Figure 3: Absolute FN and FP counts across prompt settings for GPT-4o and Llama-3.1-8B.
  • Figure 4: Distribution of perceived fault types in FN rationales (left) and top-4 perceived errors by model (right). Categories include Misread_Spec (spec misinterpretation), Added_Requirement (unstated constraints), Overthink_Edge (boundary overconcern), Assumed_Type (format assumption), Imagined_Runtime (unsupported runtime speculation), Perf_Nitpick (performance critique), Read_Nitpick (style critique), Precision_Error (numeric precision), Logic_Error (algorithmic flaw claim), and Vague_Description (vague unsupported reasoning).
  • Figure 5: Counts of inconsistent rationales (labeled as contradiction or unclear by a GPT-4o evaluator) under rationale-enabled prompts. Bars compare Direct+Explain vs. Full for each model across HumanEval, MBPP, and QuixBugs, report absolute counts.
  • ...and 4 more figures