Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond

Pengyu Xue; Linhao Wu; Zhongxing Yu; Zhi Jin; Zhen Yang; Xinyi Li; Zhenyu Yang; Yue Tan

Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond

Pengyu Xue, Linhao Wu, Zhongxing Yu, Zhi Jin, Zhen Yang, Xinyi Li, Zhenyu Yang, Yue Tan

TL;DR

This paper conducts the first comprehensive experiment to investigate how far LLM has been in applying LLM to generate high-quality commit messages and how to go further beyond in this field, and proposes an Efficient Retrieval-based In-Context Learning (ICL) framework, namely ERICommiter, which leverages a two-step filtering to accelerate the retrieval efficiency.

Abstract

Commit Message Generation (CMG) approaches aim to automatically generate commit messages based on given code diffs, which facilitate collaboration among developers and play a critical role in Open-Source Software (OSS). Very recently, Large Language Models (LLMs) have demonstrated extensive applicability in diverse code-related task. But few studies systematically explored their effectiveness using LLMs. This paper conducts the first comprehensive experiment to investigate how far we have been in applying LLM to generate high-quality commit messages. Motivated by a pilot analysis, we first clean the most widely-used CMG dataset following practitioners' criteria. Afterward, we re-evaluate diverse state-of-the-art CMG approaches and make comparisons with LLMs, demonstrating the superior performance of LLMs against state-of-the-art CMG approaches. Then, we further propose four manual metrics following the practice of OSS, including Accuracy, Integrity, Applicability, and Readability, and assess various LLMs accordingly. Results reveal that GPT-3.5 performs best overall, but different LLMs carry different advantages. To further boost LLMs' performance in the CMG task, we propose an Efficient Retrieval-based In-Context Learning (ICL) framework, namely ERICommiter, which leverages a two-step filtering to accelerate the retrieval efficiency and introduces semantic/lexical-based retrieval algorithm to construct the ICL examples. Extensive experiments demonstrate the substantial performance improvement of ERICommiter on various LLMs for code diffs of different programming languages. Meanwhile, ERICommiter also significantly reduces the retrieval time while keeping almost the same performance. Our research contributes to the understanding of LLMs' capabilities in the CMG field and provides valuable insights for practitioners seeking to leverage these tools in their workflows.

Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond

TL;DR

Abstract

Paper Structure (31 sections, 10 figures, 7 tables)

This paper contains 31 sections, 10 figures, 7 tables.

Introduction
Related Works
Commit Message Generation
What Is A Good Commit Message
Large Language Models on Code
Pilot Analysis
Experimental design
Data set preparation
Experimental Model Preparation
Automated assessment
Experimental results and analysis
Construction of the high-quality test set
Study Design
Models and Implementations
Evaluation Methodology
...and 16 more sections

Figures (10)

Figure 1: The Percentage (%) of Containing/Missing "What"/"Why" Elements in Comparison between GPT-3.5-Generated Commit Messages and Ground Truths.
Figure 2: Manual Assessment Results in Terms of Accuracy, Integrity, Readability, and Applicability in Order.
Figure 3: A Python Example of Generated Commit Messages on Accuracy.
Figure 4: An C# Example of Generated Commit Messages on Integrity/Applicability.
Figure 5: A Java Example of Generated Commit Messages on Readability.
...and 5 more figures

Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond

TL;DR

Abstract

Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond

Authors

TL;DR

Abstract

Table of Contents

Figures (10)