Table of Contents
Fetching ...

CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model

Tae-Hwan Jung

TL;DR

CommitBERT tackles automatic commit message generation from code changes. It builds a large, six-language dataset of 345K added/deleted line pairs and trains a Transformer-based encoder-decoder initialized with CodeBERT, with an additional Code-to-NL pretraining step to bridge PL-NL representations. The approach demonstrates BLEU-4 improvements over baselines when using only changed lines and Code-to-NL pretraining, confirming the value of domain-adapted pretraining for code-to-NL tasks. The work enables practical assistive tooling for developers and points to future work toward AST-based representations to better capture code structure.

Abstract

Commit message is a document that summarizes source code changes in natural language. A good commit message clearly shows the source code changes, so this enhances collaboration between developers. Therefore, our work is to develop a model that automatically writes the commit message. To this end, we release 345K datasets consisting of code modification and commit messages in six programming languages (Python, PHP, Go, Java, JavaScript, and Ruby). Similar to the neural machine translation (NMT) model, using our dataset, we feed the code modification to the encoder input and the commit message to the decoder input and measure the result of the generated commit message with BLEU-4. Also, we propose the following two training methods to improve the result of generating the commit message: (1) A method of preprocessing the input to feed the code modification to the encoder input. (2) A method that uses an initial weight suitable for the code domain to reduce the gap in contextual representation between programming language (PL) and natural language (NL). Training code, dataset, and pre-trained weights are available at https://github.com/graykode/commit-autosuggestions

CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model

TL;DR

CommitBERT tackles automatic commit message generation from code changes. It builds a large, six-language dataset of 345K added/deleted line pairs and trains a Transformer-based encoder-decoder initialized with CodeBERT, with an additional Code-to-NL pretraining step to bridge PL-NL representations. The approach demonstrates BLEU-4 improvements over baselines when using only changed lines and Code-to-NL pretraining, confirming the value of domain-adapted pretraining for code-to-NL tasks. The work enables practical assistive tooling for developers and points to future work toward AST-based representations to better capture code structure.

Abstract

Commit message is a document that summarizes source code changes in natural language. A good commit message clearly shows the source code changes, so this enhances collaboration between developers. Therefore, our work is to develop a model that automatically writes the commit message. To this end, we release 345K datasets consisting of code modification and commit messages in six programming languages (Python, PHP, Go, Java, JavaScript, and Ruby). Similar to the neural machine translation (NMT) model, using our dataset, we feed the code modification to the encoder input and the commit message to the decoder input and measure the result of the generated commit message with BLEU-4. Also, we propose the following two training methods to improve the result of generating the commit message: (1) A method of preprocessing the input to feed the code modification to the encoder input. (2) A method that uses an initial weight suitable for the code domain to reduce the gap in contextual representation between programming language (PL) and natural language (NL). Training code, dataset, and pre-trained weights are available at https://github.com/graykode/commit-autosuggestions

Paper Structure

This paper contains 17 sections, 2 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: The figure above shows an example of commit message and git diff in Github. In the Git process, git diff uses unified format (unidiff ): A line marked in red or green means a modified line, and green highlights in '+' lines are the added code, whereas red highlights in '-' lines are the deleted code.
  • Figure 2: Commit message verb type and frequency statistics. Only 'upgrade' is not included in the high frequency, but is included in a similar way to 'update'. This refers to the verb group in jiang2017towards.
  • Figure 3: Illustration of a code modification example in git diff (a) and method of taking it to the input of CommitBERT (b). (b) shows that all code modification lines in (a) are not used, and only changed lines are as input. So, in this example, code modification (a) includes pythonreturn a - b, but not in the model input (b).