Towards a Human-in-the-Loop Framework for Reliable Patch Evaluation Using an LLM-as-a-Judge
Sherry Shi, Renyao Wei, Michele Tufano, José Cambronero, Runxiang Cheng, Franjo Ivančić, Pat Rondon
TL;DR
The paper tackles the evaluation gap in Automated Program Repair (APR) by criticizing solely execution-based metrics and proposing a human-in-the-loop framework that uses an LLM to generate per-bug rubrics refined by humans, followed by an LLM judge applying the golden rubric to patch validity. Through a large-scale study on sanitizer-bug patches, the authors demonstrate that rubric refinement yields substantial agreement with human consensus (Cohen's kappa up to 0.75 on unanimous cases) and high recall (≈0.93–0.94), with precision around 0.65–0.80. Ablation studies show the critical role of the rubric template and manual refinement for reliable judgments, and reveal that patches with human disagreement remain challenging. The framework promises scalable offline APR evaluation by reducing manual effort while maintaining alignment with expert judgments, and points to future work in expanding rubric criteria, improving autonomy, and enabling multi-dimensional code quality assessments.
Abstract
Reliable evaluation is crucial for advancing Automated Program Repair (APR), but prevailing benchmarks rely on execution-based evaluation methods (unit test pass@k), which fail to capture true patch validity. Determining validity can require costly manual annotation. To reduce this cost, we introduce a human-in-the-loop approach to LLM-based patch validity judgment. Inspired by the observation that human judgment is better aligned when using a shared rubric, we first employ an LLM to generate a per-bug rubric, followed by a one-time human review and optional refinement to this rubric, and then employ an LLM to judge patches using the refined rubric. We apply this approach to assign binary validity labels to patches for issues found by Google sanitizer tools. Our results show that this approach yields substantial agreement with human consensus (Cohen's kappa 0.75), high recall (0.94) and high precision (0.80), when considering patches that have unanimous agreement from 3 human raters on the validity labels. On the full dataset including patches where human raters disagree, we find this approach can still be further improved (Cohen's kappa 0.57, recall 0.93, precision 0.65) and identify possible future directions.
