Can Large Language Models Serve as Evaluators for Code Summarization?
Yang Wu, Yao Wan, Zhaoyang Chu, Wenting Zhao, Ye Liu, Hongyu Zhang, Xuanhua Shi, Philip S. Yu
TL;DR
The paper addresses the challenge of evaluating code summaries by proposing CodeRPE, an LLM-based evaluator that uses multi-role, prompt-driven analysis to assess coherence, consistency, fluency, and relevance. It demonstrates that CodeRPE, including reference-free variants, achieves high alignment with human judgments (up to 81.59% Spearman) and outperforms traditional metrics like BERTScore by a substantial margin. The study offers extensive analysis of prompting strategies (CoT, ICL, rating forms) and shows ChatGPT often delivers the strongest, most human-aligned evaluations, while noting costs and design considerations. Overall, CodeRPE presents a scalable, competitive alternative to human evaluation for code summarization and provides guidance for effective prompt design and role assignment in LLM-based evaluators.
Abstract
Code summarization facilitates program comprehension and software maintenance by converting code snippets into natural-language descriptions. Over the years, numerous methods have been developed for this task, but a key challenge remains: effectively evaluating the quality of generated summaries. While human evaluation is effective for assessing code summary quality, it is labor-intensive and difficult to scale. Commonly used automatic metrics, such as BLEU, ROUGE-L, METEOR, and BERTScore, often fail to align closely with human judgments. In this paper, we explore the potential of Large Language Models (LLMs) for evaluating code summarization. We propose CODERPE (Role-Player for Code Summarization Evaluation), a novel method that leverages role-player prompting to assess the quality of generated summaries. Specifically, we prompt an LLM agent to play diverse roles, such as code reviewer, code author, code editor, and system analyst. Each role evaluates the quality of code summaries across key dimensions, including coherence, consistency, fluency, and relevance. We further explore the robustness of LLMs as evaluators by employing various prompting strategies, including chain-of-thought reasoning, in-context learning, and tailored rating form designs. The results demonstrate that LLMs serve as effective evaluators for code summarization methods. Notably, our LLM-based evaluator, CODERPE , achieves an 81.59% Spearman correlation with human evaluations, outperforming the existing BERTScore metric by 17.27%.
