Table of Contents
Fetching ...

KADEL: Knowledge-Aware Denoising Learning for Commit Message Generation

Wei Tao, Yucheng Zhou, Yanlin Wang, Hongyu Zhang, Haofen Wang, Wenqiang Zhang

TL;DR

KADEL proposes a knowledge-aware denoising framework for commit message generation that leverages good-practice commits to learn a commit knowledge model. The model predicts commit type and scope from code changes, and a dynamic EM-based denoising strategy reweights training losses to mitigate noise when transferring knowledge to the full dataset. Empirical results on MCMD show KADEL achieving state-of-the-art performance across multiple programming languages and data-splitting schemes, with strong human-evaluation support. The approach highlights the practical value of distilling community best practices into learning systems and suggests pathways for applying similar ideas to other software-engineering tasks and LLM-based methods.

Abstract

Commit messages are natural language descriptions of code changes, which are important for software evolution such as code understanding and maintenance. However, previous methods are trained on the entire dataset without considering the fact that a portion of commit messages adhere to good practice (i.e., good-practice commits), while the rest do not. On the basis of our empirical study, we discover that training on good-practice commits significantly contributes to the commit message generation. Motivated by this finding, we propose a novel knowledge-aware denoising learning method called KADEL. Considering that good-practice commits constitute only a small proportion of the dataset, we align the remaining training samples with these good-practice commits. To achieve this, we propose a model that learns the commit knowledge by training on good-practice commits. This knowledge model enables supplementing more information for training samples that do not conform to good practice. However, since the supplementary information may contain noise or prediction errors, we propose a dynamic denoising training method. This method composes a distribution-aware confidence function and a dynamic distribution list, which enhances the effectiveness of the training process. Experimental results on the whole MCMD dataset demonstrate that our method overall achieves state-of-the-art performance compared with previous methods. Our source code and data are available at https://github.com/DeepSoftwareAnalytics/KADEL

KADEL: Knowledge-Aware Denoising Learning for Commit Message Generation

TL;DR

KADEL proposes a knowledge-aware denoising framework for commit message generation that leverages good-practice commits to learn a commit knowledge model. The model predicts commit type and scope from code changes, and a dynamic EM-based denoising strategy reweights training losses to mitigate noise when transferring knowledge to the full dataset. Empirical results on MCMD show KADEL achieving state-of-the-art performance across multiple programming languages and data-splitting schemes, with strong human-evaluation support. The approach highlights the practical value of distilling community best practices into learning systems and suggests pathways for applying similar ideas to other software-engineering tasks and LLM-based methods.

Abstract

Commit messages are natural language descriptions of code changes, which are important for software evolution such as code understanding and maintenance. However, previous methods are trained on the entire dataset without considering the fact that a portion of commit messages adhere to good practice (i.e., good-practice commits), while the rest do not. On the basis of our empirical study, we discover that training on good-practice commits significantly contributes to the commit message generation. Motivated by this finding, we propose a novel knowledge-aware denoising learning method called KADEL. Considering that good-practice commits constitute only a small proportion of the dataset, we align the remaining training samples with these good-practice commits. To achieve this, we propose a model that learns the commit knowledge by training on good-practice commits. This knowledge model enables supplementing more information for training samples that do not conform to good practice. However, since the supplementary information may contain noise or prediction errors, we propose a dynamic denoising training method. This method composes a distribution-aware confidence function and a dynamic distribution list, which enhances the effectiveness of the training process. Experimental results on the whole MCMD dataset demonstrate that our method overall achieves state-of-the-art performance compared with previous methods. Our source code and data are available at https://github.com/DeepSoftwareAnalytics/KADEL
Paper Structure (34 sections, 9 equations, 10 figures, 15 tables)

This paper contains 34 sections, 9 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: An example of code change and the corresponding commit message from GitHub.
  • Figure 2: The three sub-figures in the upper row show the decoder attention weights of the model trained on the setting of decoder output containing "type" and "scope". The lower row shows the cross attention weights of the model trained on the setting of encoder input containing "type" and "scope". Rectangular blocks of different colors represent different head attention. The darker the color, the higher the attention weight.
  • Figure 3: The decoder attention weights of the model trained on the setting of decoder output containing "type" and "scope". The darker the color, the higher the attention weight.
  • Figure 4: Comparison of our method (upper left) and it w/o denoising (upper right) on training loss distribution among epoch evolution; their comparison (down) on 25-th epoch.
  • Figure 5: The performance under different hyperparameters $\alpha$ in different metrics. The dashed line represents the performance without denoising training.
  • ...and 5 more figures