Table of Contents
Fetching ...

Towards a Human-in-the-Loop Framework for Reliable Patch Evaluation Using an LLM-as-a-Judge

Sherry Shi, Renyao Wei, Michele Tufano, José Cambronero, Runxiang Cheng, Franjo Ivančić, Pat Rondon

TL;DR

The paper tackles the evaluation gap in Automated Program Repair (APR) by criticizing solely execution-based metrics and proposing a human-in-the-loop framework that uses an LLM to generate per-bug rubrics refined by humans, followed by an LLM judge applying the golden rubric to patch validity. Through a large-scale study on sanitizer-bug patches, the authors demonstrate that rubric refinement yields substantial agreement with human consensus (Cohen's kappa up to 0.75 on unanimous cases) and high recall (≈0.93–0.94), with precision around 0.65–0.80. Ablation studies show the critical role of the rubric template and manual refinement for reliable judgments, and reveal that patches with human disagreement remain challenging. The framework promises scalable offline APR evaluation by reducing manual effort while maintaining alignment with expert judgments, and points to future work in expanding rubric criteria, improving autonomy, and enabling multi-dimensional code quality assessments.

Abstract

Reliable evaluation is crucial for advancing Automated Program Repair (APR), but prevailing benchmarks rely on execution-based evaluation methods (unit test pass@k), which fail to capture true patch validity. Determining validity can require costly manual annotation. To reduce this cost, we introduce a human-in-the-loop approach to LLM-based patch validity judgment. Inspired by the observation that human judgment is better aligned when using a shared rubric, we first employ an LLM to generate a per-bug rubric, followed by a one-time human review and optional refinement to this rubric, and then employ an LLM to judge patches using the refined rubric. We apply this approach to assign binary validity labels to patches for issues found by Google sanitizer tools. Our results show that this approach yields substantial agreement with human consensus (Cohen's kappa 0.75), high recall (0.94) and high precision (0.80), when considering patches that have unanimous agreement from 3 human raters on the validity labels. On the full dataset including patches where human raters disagree, we find this approach can still be further improved (Cohen's kappa 0.57, recall 0.93, precision 0.65) and identify possible future directions.

Towards a Human-in-the-Loop Framework for Reliable Patch Evaluation Using an LLM-as-a-Judge

TL;DR

The paper tackles the evaluation gap in Automated Program Repair (APR) by criticizing solely execution-based metrics and proposing a human-in-the-loop framework that uses an LLM to generate per-bug rubrics refined by humans, followed by an LLM judge applying the golden rubric to patch validity. Through a large-scale study on sanitizer-bug patches, the authors demonstrate that rubric refinement yields substantial agreement with human consensus (Cohen's kappa up to 0.75 on unanimous cases) and high recall (≈0.93–0.94), with precision around 0.65–0.80. Ablation studies show the critical role of the rubric template and manual refinement for reliable judgments, and reveal that patches with human disagreement remain challenging. The framework promises scalable offline APR evaluation by reducing manual effort while maintaining alignment with expert judgments, and points to future work in expanding rubric criteria, improving autonomy, and enabling multi-dimensional code quality assessments.

Abstract

Reliable evaluation is crucial for advancing Automated Program Repair (APR), but prevailing benchmarks rely on execution-based evaluation methods (unit test pass@k), which fail to capture true patch validity. Determining validity can require costly manual annotation. To reduce this cost, we introduce a human-in-the-loop approach to LLM-based patch validity judgment. Inspired by the observation that human judgment is better aligned when using a shared rubric, we first employ an LLM to generate a per-bug rubric, followed by a one-time human review and optional refinement to this rubric, and then employ an LLM to judge patches using the refined rubric. We apply this approach to assign binary validity labels to patches for issues found by Google sanitizer tools. Our results show that this approach yields substantial agreement with human consensus (Cohen's kappa 0.75), high recall (0.94) and high precision (0.80), when considering patches that have unanimous agreement from 3 human raters on the validity labels. On the full dataset including patches where human raters disagree, we find this approach can still be further improved (Cohen's kappa 0.57, recall 0.93, precision 0.65) and identify possible future directions.

Paper Structure

This paper contains 32 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of our two-stage framework. $\textcircled{1}$Rubric Generation: An LLM first generates a per-bug rubric based on the bug's description and the ground truth patch. This rubric is reviewed and refined once by two human experts. $\textcircled{2}$Patch Evaluation: The refined rubric is repeatedly used by an LLM judge to evaluate patches of the same bug. The LLM judge outputs a binary validity label and a natural language justification to the label, enabling both quantitative and qualitative analyses.
  • Figure 2: CDF of normalized edit distance on 50 rubrics. The median normalized edit distance is 14% of the original content, while the maximum is 70% of the original content.
  • Figure 3: A 16-percentage-point drop from pass@20 to (pass & LLM-valid)@20 due to LLM judge rejecting patches that violate requirements in rubric, fail to address root cause, are incomplete, or introduce new issues.