Simple and Effective Baselines for Code Summarisation Evaluation

Jade Robinson; Jonathan K. Kummerfeld

Simple and Effective Baselines for Code Summarisation Evaluation

Jade Robinson, Jonathan K. Kummerfeld

TL;DR

This paper introduces a simple LLM-based baseline for evaluating code summaries by asking an LLM to assign an overall quality score, explicitly leveraging access to the code and, in a variant, removing the need for a reference summary. It also proposes a reference-free variant to enable broader applications such as flagging low-quality documentation in codebases. The authors benchmark against standard n-gram and embedding-based metrics on two human-judged datasets, showing the Ask-LLM baselines often outperform traditional metrics for overall quality and remain competitive with embeddings for similarity, while highlighting potential LLM biases. They further analyze prompting strategies, model dependence, costs, and language limitations, recommending a hybrid evaluation approach that combines embedding-based metrics with Ask-LLM methods for robust, scalable code-summarisation assessment.

Abstract

Code documentation is useful, but writing it is time-consuming. Different techniques for generating code summaries have emerged, but comparing them is difficult because human evaluation is expensive and automatic metrics are unreliable. In this paper, we introduce a simple new baseline in which we ask an LLM to give an overall score to a summary. Unlike n-gram and embedding-based baselines, our approach is able to consider the code when giving a score. This allows us to also make a variant that does not consider the reference summary at all, which could be used for other tasks, e.g., to evaluate the quality of documentation in code bases. We find that our method is as good or better than prior metrics, though we recommend using it in conjunction with embedding-based methods to avoid the risk of LLM-specific bias.

Simple and Effective Baselines for Code Summarisation Evaluation

TL;DR

Abstract

Simple and Effective Baselines for Code Summarisation Evaluation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)