An Empirical Study of Token-based Micro Commits
Masanari Kondo, Daniel M. German, Yasutaka Kamei, Naoyasu Ubayashi, Osamu Mizuno
TL;DR
This study defines micro commits as changes that add or remove at most five tokens, enabling token-level analysis of very small code changes to overcome line-focused limitations. Using cregit/srcML to produce token-based diffs, the authors analyze four large OSS projects (Java and C) to quantify frequency, token-types affected, and typical change patterns, finding micro commits constitute 7.45%–17.95% of commits and commonly replace a single name or literal token, with many fixes related to bugs. They show that roughly 90% of one-line commits are micro commits, but only 40–50% of micro commits are one-line commits, and about 30–40% of micro commits span multiple hunks, revealing insights not visible with line-based analysis. The work highlights the value of token-level metrics for program repair and QA, provides a replication package, and discusses implications for extending complexity metrics and future semantic-based definitions of micro commits.
Abstract
In software development, developers frequently apply maintenance activities to the source code that change a few lines by a single commit. A good understanding of the characteristics of such small changes can support quality assurance approaches (e.g., automated program repair), as it is likely that small changes are addressing deficiencies in other changes; thus, understanding the reasons for creating small changes can help understand the types of errors introduced. Eventually, these reasons and the types of errors can be used to enhance quality assurance approaches for improving code quality. While prior studies used code churns to characterize and investigate the small changes, such a definition has a critical limitation. Specifically, it loses the information of changed tokens in a line. For example, this definition fails to distinguish the following two one-line changes: (1) changing a string literal to fix a displayed message and (2) changing a function call and adding a new parameter. These are definitely maintenance activities, but we deduce that researchers and practitioners are interested in supporting the latter change. To address this limitation, in this paper, we define micro commits, a type of small change based on changed tokens. Our goal is to quantify small changes using changed tokens. Changed tokens allow us to identify small changes more precisely. In fact, this token-level definition can distinguish the above example. We investigate defined micro commits in four OSS projects and understand their characteristics as the first empirical study on token-based micro commits. We find that micro commits mainly replace a single name or literal token, and micro commits are more likely used to fix bugs. Additionally, we propose the use of token-based information to support software engineering approaches in which very small changes significantly affect their effectiveness.
