Table of Contents
Fetching ...

An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks

Xin Zhou, Kisub Kim, Ting Zhang, Martin Weyssow, Luis F. Gomes, Guang Yang, Kui Liu, Xin Xia, David Lo

TL;DR

This work tackles the challenge of automatically evaluating the functional correctness of AI-generated SE artifacts by aligning automatic metrics with human judgments. It introduces SE-Jury, an LLM-as-Ensemble-Judge that defines five evaluation strategies, forms a dynamic team of evaluators, and ensembles their scores into a final assessment. Across code generation, automated program repair, and code summarization, SE-Jury achieves substantially higher correlation with human judgments than baselines and attains near-inter-annotator agreement in several tasks, while reducing LLM usage via dynamic team selection. The framework offers a scalable alternative to human evaluation with practical implications for broad SE tasks and potential extensions to non-functional properties.

Abstract

Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalability. On the other hand, many automatic evaluation metrics are scalable and require minimal human effort, but they often fail to accurately reflect the actual correctness of generated software artifacts. In this paper, we present SE-Jury, the first evaluation metric for LLM-as-Ensemble-Judge specifically designed to accurately assess the correctness of generated software artifacts. SE-Jury first defines five distinct evaluation strategies, each implemented by an independent judge. A dynamic team selection mechanism then identifies the most appropriate subset of judges as a team to produce a final correctness score through ensembling. We evaluate SE-Jury across a diverse set of software engineering (SE) benchmarks that span three popular SE tasks: code generation, automated program repair, and code summarization. Results demonstrate that SE-Jury consistently achieves a higher correlation with human judgments, with improvements ranging from 29.6% to 140.8% over existing automatic metrics. SE-Jury reaches agreement levels with human annotators that are close to inter-annotator agreement in code generation and program repair. These findings underscore SE-Jury's potential as a scalable and reliable alternative to human evaluation in these SE tasks.

An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks

TL;DR

This work tackles the challenge of automatically evaluating the functional correctness of AI-generated SE artifacts by aligning automatic metrics with human judgments. It introduces SE-Jury, an LLM-as-Ensemble-Judge that defines five evaluation strategies, forms a dynamic team of evaluators, and ensembles their scores into a final assessment. Across code generation, automated program repair, and code summarization, SE-Jury achieves substantially higher correlation with human judgments than baselines and attains near-inter-annotator agreement in several tasks, while reducing LLM usage via dynamic team selection. The framework offers a scalable alternative to human evaluation with practical implications for broad SE tasks and potential extensions to non-functional properties.

Abstract

Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalability. On the other hand, many automatic evaluation metrics are scalable and require minimal human effort, but they often fail to accurately reflect the actual correctness of generated software artifacts. In this paper, we present SE-Jury, the first evaluation metric for LLM-as-Ensemble-Judge specifically designed to accurately assess the correctness of generated software artifacts. SE-Jury first defines five distinct evaluation strategies, each implemented by an independent judge. A dynamic team selection mechanism then identifies the most appropriate subset of judges as a team to produce a final correctness score through ensembling. We evaluate SE-Jury across a diverse set of software engineering (SE) benchmarks that span three popular SE tasks: code generation, automated program repair, and code summarization. Results demonstrate that SE-Jury consistently achieves a higher correlation with human judgments, with improvements ranging from 29.6% to 140.8% over existing automatic metrics. SE-Jury reaches agreement levels with human annotators that are close to inter-annotator agreement in code generation and program repair. These findings underscore SE-Jury's potential as a scalable and reliable alternative to human evaluation in these SE tasks.

Paper Structure

This paper contains 25 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of SE-Jury.
  • Figure 2: Prompt Designs of Strategy 4 and Strategy 5.
  • Figure 3: Agreements between human developers ("H-H" and highlighted in blue) and agreements between SE-Jury and humans ("H-T" and highlighted in orange).