Table of Contents
Fetching ...

Commit Messages in the Age of Large Language Models

Cristina V. Lopes, Vanessa I. Klotzman, Iris Ma, Iftekar Ahmed

TL;DR

This study evaluates OpenAI's ChatGPT for automatic commit message generation from code diffs, comparing it to established ACMG baselines. Through a mixed-methods design using 108 diffs, qualitative judgments, and expert validation, ChatGPT demonstrates strong qualitative performance, generating informative messages with justifications that surpass prior models in many cases. NLP metrics show modest gains or mixed results, highlighting the need for human-centric evaluation in ACMG. The findings suggest that prompt engineering and interactive corrections can further enhance reliability, pointing toward practical integration of LLMs with developer workflows while acknowledging the necessity of guarding against context gaps and hallucinations.

Abstract

Commit messages are explanations of changes made to a codebase that are stored in version control systems. They help developers understand the codebase as it evolves. However, writing commit messages can be tedious and inconsistent among developers. To address this issue, researchers have tried using different methods to automatically generate commit messages, including rule-based, retrieval-based, and learning-based approaches. Advances in large language models offer new possibilities for generating commit messages. In this study, we evaluate the performance of OpenAI's ChatGPT for generating commit messages based on code changes. We compare the results obtained with ChatGPT to previous automatic commit message generation methods that have been trained specifically on commit data. Our goal is to assess the extent to which large pre-trained language models can generate commit messages that are both quantitatively and qualitatively acceptable. We found that ChatGPT was able to outperform previous Automatic Commit Message Generation (ACMG) methods by orders of magnitude, and that, generally, the messages it generates are both accurate and of high-quality. We also provide insights, and a categorization, for the cases where it fails.

Commit Messages in the Age of Large Language Models

TL;DR

This study evaluates OpenAI's ChatGPT for automatic commit message generation from code diffs, comparing it to established ACMG baselines. Through a mixed-methods design using 108 diffs, qualitative judgments, and expert validation, ChatGPT demonstrates strong qualitative performance, generating informative messages with justifications that surpass prior models in many cases. NLP metrics show modest gains or mixed results, highlighting the need for human-centric evaluation in ACMG. The findings suggest that prompt engineering and interactive corrections can further enhance reliability, pointing toward practical integration of LLMs with developer workflows while acknowledging the necessity of guarding against context gaps and hallucinations.

Abstract

Commit messages are explanations of changes made to a codebase that are stored in version control systems. They help developers understand the codebase as it evolves. However, writing commit messages can be tedious and inconsistent among developers. To address this issue, researchers have tried using different methods to automatically generate commit messages, including rule-based, retrieval-based, and learning-based approaches. Advances in large language models offer new possibilities for generating commit messages. In this study, we evaluate the performance of OpenAI's ChatGPT for generating commit messages based on code changes. We compare the results obtained with ChatGPT to previous automatic commit message generation methods that have been trained specifically on commit data. Our goal is to assess the extent to which large pre-trained language models can generate commit messages that are both quantitatively and qualitatively acceptable. We found that ChatGPT was able to outperform previous Automatic Commit Message Generation (ACMG) methods by orders of magnitude, and that, generally, the messages it generates are both accurate and of high-quality. We also provide insights, and a categorization, for the cases where it fails.
Paper Structure (23 sections, 2 figures, 8 tables)