Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings

Petr Tsvetkov; Aleksandra Eliseeva; Danny Dig; Alexander Bezzubov; Yaroslav Golubev; Timofey Bryksin; Yaroslav Zharov

Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings

Petr Tsvetkov, Aleksandra Eliseeva, Danny Dig, Alexander Bezzubov, Yaroslav Golubev, Timofey Bryksin, Yaroslav Zharov

TL;DR

This work tackles the challenge of evaluating commit message generation (CMG) by reconciling offline similarity metrics with real-user online signals. It introduces a framework that uses the online metric $ED(G,E)$—the edit distance between generated and edited commit messages—as the reference signal to select offline metrics via the correlation $Q(m)$. A novel dataset with multiple $(G,E)$ pairs per commit is built, including expert-labeled and synthetically extended entries, and a data-validation step confirms the dataset resembles real usage. The key finding is that edit distance and its normalized form best predict online edits, while standard metrics like BLEU, METEOR, and ROUGE show weak or negative correlations, underscoring a mismatch between traditional offline evaluation and production user experience. The work provides public tools and data to enable fast, production-aligned offline CMG experimentation, with implications for more realistic evaluation in software engineering assistants.

Abstract

When a Commit Message Generation (CMG) system is integrated into the IDEs and other products at JetBrains, we perform online evaluation based on user acceptance of the generated messages. However, performing online experiments with every change to a CMG system is troublesome, as each iteration affects users and requires time to collect enough statistics. On the other hand, offline evaluation, a prevalent approach in the research literature, facilitates fast experiments but employs automatic metrics that are not guaranteed to represent the preferences of real users. In this work, we describe a novel way we employed to deal with this problem at JetBrains, by leveraging an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments. To support this new type of evaluation, we develop a novel markup collection tool mimicking the real workflow with a CMG system, collect a dataset with 57 pairs consisting of commit messages generated by GPT-4 and their counterparts edited by human experts, and design and verify a way to synthetically extend such a dataset. Then, we use the final dataset of 656 pairs to study how the widely used similarity metrics correlate with the online metric reflecting the real users' experience. Our results indicate that edit distance exhibits the highest correlation with the online metric, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation. This contradicts the previous studies on similarity metrics for CMG, suggesting that user interactions with a CMG system in real-world settings differ significantly from the responses by human labelers within controlled environments. We release all the code and the dataset to support future research in the field: https://jb.gg/cmg-evaluation.

Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings

TL;DR

Abstract

Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)