SBFT Tool Competition 2025 -- Java Test Case Generation Track

Fitsum Kifetew; Lin Yun; Davide Prandi

SBFT Tool Competition 2025 -- Java Test Case Generation Track

Fitsum Kifetew, Lin Yun, Davide Prandi

TL;DR

SBFT 2025 Java Test Case Generation Track benchmarked EvoSuite, EvoFuzz, BBC, and Randoop on a fresh set of $55$ final CUTs drawn from six open-source projects. The study combines traditional structural coverage metrics with an innovative LLM-based readability assessment, revealing BBC as the top tool for structural coverage while EvoSuite and EvoFuzz excel in readability; the overall ranking favors BBC when readability is incorporated, with readability contributing $0.10$ of the total score. The final dataset, though reduced from the initial $955$ classes due to runtime errors, remains challenging and provides nuanced insights into tool effectiveness, especially for EvoFuzz/BBC versus Randoop. The work demonstrates the feasibility of LLM-assisted readability scoring and highlights differences within EvoSuite-family tools, informing future improvements in Java unit test generation.

Abstract

This short report presents the 2025 edition of the Java Unit Testing Competition in which four test generation tools (EVOFUZZ, EVOSUITE, BBC, and RANDOOP) were benchmarked on a freshly selected set of 55 Java classes from six different open source projects. The benchmarking was based on structural metrics, such as code and mutation coverage of the classes under test, as well as on the readability of the generated test cases.

SBFT Tool Competition 2025 -- Java Test Case Generation Track

TL;DR

SBFT 2025 Java Test Case Generation Track benchmarked EvoSuite, EvoFuzz, BBC, and Randoop on a fresh set of

final CUTs drawn from six open-source projects. The study combines traditional structural coverage metrics with an innovative LLM-based readability assessment, revealing BBC as the top tool for structural coverage while EvoSuite and EvoFuzz excel in readability; the overall ranking favors BBC when readability is incorporated, with readability contributing

of the total score. The final dataset, though reduced from the initial

classes due to runtime errors, remains challenging and provides nuanced insights into tool effectiveness, especially for EvoFuzz/BBC versus Randoop. The work demonstrates the feasibility of LLM-assisted readability scoring and highlights differences within EvoSuite-family tools, informing future improvements in Java unit test generation.

SBFT Tool Competition 2025 -- Java Test Case Generation Track

TL;DR

Abstract

SBFT Tool Competition 2025 -- Java Test Case Generation Track

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)