Table of Contents
Fetching ...

Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings

Petr Tsvetkov, Aleksandra Eliseeva, Danny Dig, Alexander Bezzubov, Yaroslav Golubev, Timofey Bryksin, Yaroslav Zharov

TL;DR

This work tackles the challenge of evaluating commit message generation (CMG) by reconciling offline similarity metrics with real-user online signals. It introduces a framework that uses the online metric $ED(G,E)$—the edit distance between generated and edited commit messages—as the reference signal to select offline metrics via the correlation $Q(m)$. A novel dataset with multiple $(G,E)$ pairs per commit is built, including expert-labeled and synthetically extended entries, and a data-validation step confirms the dataset resembles real usage. The key finding is that edit distance and its normalized form best predict online edits, while standard metrics like BLEU, METEOR, and ROUGE show weak or negative correlations, underscoring a mismatch between traditional offline evaluation and production user experience. The work provides public tools and data to enable fast, production-aligned offline CMG experimentation, with implications for more realistic evaluation in software engineering assistants.

Abstract

When a Commit Message Generation (CMG) system is integrated into the IDEs and other products at JetBrains, we perform online evaluation based on user acceptance of the generated messages. However, performing online experiments with every change to a CMG system is troublesome, as each iteration affects users and requires time to collect enough statistics. On the other hand, offline evaluation, a prevalent approach in the research literature, facilitates fast experiments but employs automatic metrics that are not guaranteed to represent the preferences of real users. In this work, we describe a novel way we employed to deal with this problem at JetBrains, by leveraging an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments. To support this new type of evaluation, we develop a novel markup collection tool mimicking the real workflow with a CMG system, collect a dataset with 57 pairs consisting of commit messages generated by GPT-4 and their counterparts edited by human experts, and design and verify a way to synthetically extend such a dataset. Then, we use the final dataset of 656 pairs to study how the widely used similarity metrics correlate with the online metric reflecting the real users' experience. Our results indicate that edit distance exhibits the highest correlation with the online metric, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation. This contradicts the previous studies on similarity metrics for CMG, suggesting that user interactions with a CMG system in real-world settings differ significantly from the responses by human labelers within controlled environments. We release all the code and the dataset to support future research in the field: https://jb.gg/cmg-evaluation.

Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings

TL;DR

This work tackles the challenge of evaluating commit message generation (CMG) by reconciling offline similarity metrics with real-user online signals. It introduces a framework that uses the online metric —the edit distance between generated and edited commit messages—as the reference signal to select offline metrics via the correlation . A novel dataset with multiple pairs per commit is built, including expert-labeled and synthetically extended entries, and a data-validation step confirms the dataset resembles real usage. The key finding is that edit distance and its normalized form best predict online edits, while standard metrics like BLEU, METEOR, and ROUGE show weak or negative correlations, underscoring a mismatch between traditional offline evaluation and production user experience. The work provides public tools and data to enable fast, production-aligned offline CMG experimentation, with implications for more realistic evaluation in software engineering assistants.

Abstract

When a Commit Message Generation (CMG) system is integrated into the IDEs and other products at JetBrains, we perform online evaluation based on user acceptance of the generated messages. However, performing online experiments with every change to a CMG system is troublesome, as each iteration affects users and requires time to collect enough statistics. On the other hand, offline evaluation, a prevalent approach in the research literature, facilitates fast experiments but employs automatic metrics that are not guaranteed to represent the preferences of real users. In this work, we describe a novel way we employed to deal with this problem at JetBrains, by leveraging an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments. To support this new type of evaluation, we develop a novel markup collection tool mimicking the real workflow with a CMG system, collect a dataset with 57 pairs consisting of commit messages generated by GPT-4 and their counterparts edited by human experts, and design and verify a way to synthetically extend such a dataset. Then, we use the final dataset of 656 pairs to study how the widely used similarity metrics correlate with the online metric reflecting the real users' experience. Our results indicate that edit distance exhibits the highest correlation with the online metric, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation. This contradicts the previous studies on similarity metrics for CMG, suggesting that user interactions with a CMG system in real-world settings differ significantly from the responses by human labelers within controlled environments. We release all the code and the dataset to support future research in the field: https://jb.gg/cmg-evaluation.

Paper Structure

This paper contains 17 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: We propose to collect multiple pairs of generated messages $G$ and their edited versions $E$ for each commit $C$ to find an offline metric (teal lines on the scheme) that estimates the online metric (magenta lines on the scheme) as best as possible. This allows us to select the model that will show the best quality when deployed and evaluated online (dotted lines) while evaluating it in the offline setting (dashed line).
  • Figure 2: Overview of our dataset collection process.
  • Figure 3: A screenshot of our web application for the collection of commit message edits. On the left, the assessors are presented with a set of code changes from the current commit in a diff format. On the right, there is a window with a model-generated commit message to be edited, all changes in this window are tracked. Additionally, there is a Help toggle with a labeling instruction on top and a Commit summary toggle with extra information about the current project, aimed to provide more context for the code changes. The web application is publicly available cmg-web-application.
  • Figure 4: Distribution of the $\mathbf{ED(G,E)}$ values for the different subsets of our dataset and for the PyCharm users logs over one month in April-May 2024, where $\mathbf{ED(G,E)}$ is the edit distance between model-generated messages $\mathbf{G}$ and their edited versions $\mathbf{E}$. Note that we scale the PyCharm logs $\mathbf{ED}$ values to adjust for the differences in messages' lengths and discard the samples with edit distance equal to 0.
  • Figure 5: Overview of the relations between different $(G,E)$ pairs in our dataset. The pairs connected with the lines are the related pairs. Any pair $(G,E)$ for a given commit message $C$ that is not related is conditionally independent. We employ related pairs to calculate online metrics and conditionally independent pairs to calculate offline metrics.