An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks
Xin Zhou, Kisub Kim, Ting Zhang, Martin Weyssow, Luis F. Gomes, Guang Yang, Kui Liu, Xin Xia, David Lo
TL;DR
This work tackles the challenge of automatically evaluating the functional correctness of AI-generated SE artifacts by aligning automatic metrics with human judgments. It introduces SE-Jury, an LLM-as-Ensemble-Judge that defines five evaluation strategies, forms a dynamic team of evaluators, and ensembles their scores into a final assessment. Across code generation, automated program repair, and code summarization, SE-Jury achieves substantially higher correlation with human judgments than baselines and attains near-inter-annotator agreement in several tasks, while reducing LLM usage via dynamic team selection. The framework offers a scalable alternative to human evaluation with practical implications for broad SE tasks and potential extensions to non-functional properties.
Abstract
Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalability. On the other hand, many automatic evaluation metrics are scalable and require minimal human effort, but they often fail to accurately reflect the actual correctness of generated software artifacts. In this paper, we present SE-Jury, the first evaluation metric for LLM-as-Ensemble-Judge specifically designed to accurately assess the correctness of generated software artifacts. SE-Jury first defines five distinct evaluation strategies, each implemented by an independent judge. A dynamic team selection mechanism then identifies the most appropriate subset of judges as a team to produce a final correctness score through ensembling. We evaluate SE-Jury across a diverse set of software engineering (SE) benchmarks that span three popular SE tasks: code generation, automated program repair, and code summarization. Results demonstrate that SE-Jury consistently achieves a higher correlation with human judgments, with improvements ranging from 29.6% to 140.8% over existing automatic metrics. SE-Jury reaches agreement levels with human annotators that are close to inter-annotator agreement in code generation and program repair. These findings underscore SE-Jury's potential as a scalable and reliable alternative to human evaluation in these SE tasks.
