Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness

Wendkûuni C. Ouédraogo; Yinghua Li; Xueqi Dang; Xin Zhou; Anil Koyuncu; Jacques Klein; David Lo; Tegawendé F. Bissyandé

Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness

Wendkûuni C. Ouédraogo, Yinghua Li, Xueqi Dang, Xin Zhou, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F. Bissyandé

TL;DR

An unsupervised evaluation of LLM-based test refactorings is hampered by CodeBLEU’s surface-focused bias. The authors introduce CTSES, a composite metric that blends CodeBLEU, METEOR, and ROUGE-L into a weighted linear combination, defined as $CTSES = α·CodeBLEU + β·METEOR + γ·ROUGE-L$ with $α+β+γ=1$, and explore two practical profiles (CTSES1 and CTSES2). On 5,000+ refactorings from Defects4J and SF110 produced by GPT-4o and Mistral-Large, CTSES shows closer alignment with human judgments and fewer false negatives than individual metrics, while retaining interpretability through tunable weights. The study also includes a human-centered validation with 15 refactorings, revealing that CodeBLEU often underestimates benefit from readability and structural improvements, and outlining future work to enrich CTSES with dimensions like naming quality and dynamic quality indicators. Overall, CTSES represents a promising step toward human-aligned, composite evaluation of test refactoring that can be integrated into development pipelines.

Abstract

Large Language Models (LLMs) are increasingly used to refactor unit tests, improving readability and structure while preserving behavior. Evaluating such refactorings, however, remains difficult: metrics like CodeBLEU penalize beneficial renamings and edits, while semantic similarities overlook readability and modularity. We propose CTSES, a first step toward human-aligned evaluation of refactored tests. CTSES combines CodeBLEU, METEOR, and ROUGE-L into a composite score that balances semantics, lexical clarity, and structural alignment. Evaluated on 5,000+ refactorings from Defects4J and SF110 (GPT-4o and Mistral-Large), CTSES reduces false negatives and provides more interpretable signals than individual metrics. Our emerging results illustrate that CTSES offers a proof-of-concept for composite approaches, showing their promise in bridging automated metrics and developer judgments.

Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness

TL;DR

Abstract

Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness

TL;DR

Abstract

Paper Structure

Table of Contents