Table of Contents
Fetching ...

Assertion-Aware Test Code Summarization with Large Language Models

Anamul Haque Mollah, Ahmed Aljohani, Hyunsook Do

TL;DR

The paper addresses the problem that unit-test methods often lack concise, developer-aligned summaries and proposes a benchmark of 91 real-world Java tests paired with developer-written summaries. It systematically ablates seven prompt variants and evaluates four code LLMs (Codex, Codestral, DeepSeek, Qwen-Coder) using multiple metrics including BLEU, ROUGE-L, METEOR, BERTScore, and LLM-Eval. The results show that assertion-level context—especially assertion messages with semantics—consistently improves summary quality and can match or exceed using the full method-under-test context, with Codex and Qwen-Coder achieving the strongest human-aligned performance. The work provides practical guidance for prompt design in test summarization and supplies a replication package to enable future benchmarking and tooling for developer documentation.

Abstract

Unit tests often lack concise summaries that convey test intent, especially in auto-generated or poorly documented codebases. Large Language Models (LLMs) offer a promising solution, but their effectiveness depends heavily on how they are prompted. Unlike generic code summarization, test-code summarization poses distinct challenges because test methods validate expected behavior through assertions rather than implementing functionality. This paper presents a new benchmark of 91 real-world Java test cases paired with developer-written summaries and conducts a controlled ablation study to investigate how test code-related components-such as the method under test (MUT), assertion messages, and assertion semantics-affect the performance of LLM-generated test summaries. We evaluate four code LLMs (Codex, Codestral, DeepSeek, and Qwen-Coder) across seven prompt configurations using n-gram metrics (BLEU, ROUGE-L, METEOR), semantic similarity (BERTScore), and LLM-based evaluation. Results show that prompting with assertion semantics improves summary quality by an average of 0.10 points (2.3%) over full MUT context (4.45 vs. 4.35) while requiring fewer input tokens. Codex and Qwen-Coder achieve the highest alignment with human-written summaries, while DeepSeek underperforms despite high lexical overlap. The replication package is publicly available at https://doi.org/10. 5281/zenodo.17067550

Assertion-Aware Test Code Summarization with Large Language Models

TL;DR

The paper addresses the problem that unit-test methods often lack concise, developer-aligned summaries and proposes a benchmark of 91 real-world Java tests paired with developer-written summaries. It systematically ablates seven prompt variants and evaluates four code LLMs (Codex, Codestral, DeepSeek, Qwen-Coder) using multiple metrics including BLEU, ROUGE-L, METEOR, BERTScore, and LLM-Eval. The results show that assertion-level context—especially assertion messages with semantics—consistently improves summary quality and can match or exceed using the full method-under-test context, with Codex and Qwen-Coder achieving the strongest human-aligned performance. The work provides practical guidance for prompt design in test summarization and supplies a replication package to enable future benchmarking and tooling for developer documentation.

Abstract

Unit tests often lack concise summaries that convey test intent, especially in auto-generated or poorly documented codebases. Large Language Models (LLMs) offer a promising solution, but their effectiveness depends heavily on how they are prompted. Unlike generic code summarization, test-code summarization poses distinct challenges because test methods validate expected behavior through assertions rather than implementing functionality. This paper presents a new benchmark of 91 real-world Java test cases paired with developer-written summaries and conducts a controlled ablation study to investigate how test code-related components-such as the method under test (MUT), assertion messages, and assertion semantics-affect the performance of LLM-generated test summaries. We evaluate four code LLMs (Codex, Codestral, DeepSeek, and Qwen-Coder) across seven prompt configurations using n-gram metrics (BLEU, ROUGE-L, METEOR), semantic similarity (BERTScore), and LLM-based evaluation. Results show that prompting with assertion semantics improves summary quality by an average of 0.10 points (2.3%) over full MUT context (4.45 vs. 4.35) while requiring fewer input tokens. Codex and Qwen-Coder achieve the highest alignment with human-written summaries, while DeepSeek underperforms despite high lexical overlap. The replication package is publicly available at https://doi.org/10. 5281/zenodo.17067550

Paper Structure

This paper contains 14 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Overall methodology for test code summarization. We parse Java test methods and extract their developer-written comments, assertion statements, and corresponding MUT. Each assertion is processed by an LLM to generate its semantic meaning. These parsed and enriched components then serve to construct structured prompts for our ablation study. Finally, we evaluate the quality of the generated summaries using standard text similarity and LLM-based metrics.