Table of Contents
Fetching ...

Simple and Effective Baselines for Code Summarisation Evaluation

Jade Robinson, Jonathan K. Kummerfeld

TL;DR

This paper introduces a simple LLM-based baseline for evaluating code summaries by asking an LLM to assign an overall quality score, explicitly leveraging access to the code and, in a variant, removing the need for a reference summary. It also proposes a reference-free variant to enable broader applications such as flagging low-quality documentation in codebases. The authors benchmark against standard n-gram and embedding-based metrics on two human-judged datasets, showing the Ask-LLM baselines often outperform traditional metrics for overall quality and remain competitive with embeddings for similarity, while highlighting potential LLM biases. They further analyze prompting strategies, model dependence, costs, and language limitations, recommending a hybrid evaluation approach that combines embedding-based metrics with Ask-LLM methods for robust, scalable code-summarisation assessment.

Abstract

Code documentation is useful, but writing it is time-consuming. Different techniques for generating code summaries have emerged, but comparing them is difficult because human evaluation is expensive and automatic metrics are unreliable. In this paper, we introduce a simple new baseline in which we ask an LLM to give an overall score to a summary. Unlike n-gram and embedding-based baselines, our approach is able to consider the code when giving a score. This allows us to also make a variant that does not consider the reference summary at all, which could be used for other tasks, e.g., to evaluate the quality of documentation in code bases. We find that our method is as good or better than prior metrics, though we recommend using it in conjunction with embedding-based methods to avoid the risk of LLM-specific bias.

Simple and Effective Baselines for Code Summarisation Evaluation

TL;DR

This paper introduces a simple LLM-based baseline for evaluating code summaries by asking an LLM to assign an overall quality score, explicitly leveraging access to the code and, in a variant, removing the need for a reference summary. It also proposes a reference-free variant to enable broader applications such as flagging low-quality documentation in codebases. The authors benchmark against standard n-gram and embedding-based metrics on two human-judged datasets, showing the Ask-LLM baselines often outperform traditional metrics for overall quality and remain competitive with embeddings for similarity, while highlighting potential LLM biases. They further analyze prompting strategies, model dependence, costs, and language limitations, recommending a hybrid evaluation approach that combines embedding-based metrics with Ask-LLM methods for robust, scalable code-summarisation assessment.

Abstract

Code documentation is useful, but writing it is time-consuming. Different techniques for generating code summaries have emerged, but comparing them is difficult because human evaluation is expensive and automatic metrics are unreliable. In this paper, we introduce a simple new baseline in which we ask an LLM to give an overall score to a summary. Unlike n-gram and embedding-based baselines, our approach is able to consider the code when giving a score. This allows us to also make a variant that does not consider the reference summary at all, which could be used for other tasks, e.g., to evaluate the quality of documentation in code bases. We find that our method is as good or better than prior metrics, though we recommend using it in conjunction with embedding-based methods to avoid the risk of LLM-specific bias.

Paper Structure

This paper contains 102 sections, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Correlation with Adequacy by Reference Quality on the haque2022 dataset
  • Figure 2: Example from roy2021 (Note: this is a particularly short example)
  • Figure 3: Ask LLM Directly Final Prompt
  • Figure 4: Question Answering Prompt for Question Generation Step
  • Figure 5: Prompt given to Claude for Summary Generation
  • ...and 2 more figures