Table of Contents
Fetching ...

An Empirical Study of Token-based Micro Commits

Masanari Kondo, Daniel M. German, Yasutaka Kamei, Naoyasu Ubayashi, Osamu Mizuno

TL;DR

This study defines micro commits as changes that add or remove at most five tokens, enabling token-level analysis of very small code changes to overcome line-focused limitations. Using cregit/srcML to produce token-based diffs, the authors analyze four large OSS projects (Java and C) to quantify frequency, token-types affected, and typical change patterns, finding micro commits constitute 7.45%–17.95% of commits and commonly replace a single name or literal token, with many fixes related to bugs. They show that roughly 90% of one-line commits are micro commits, but only 40–50% of micro commits are one-line commits, and about 30–40% of micro commits span multiple hunks, revealing insights not visible with line-based analysis. The work highlights the value of token-level metrics for program repair and QA, provides a replication package, and discusses implications for extending complexity metrics and future semantic-based definitions of micro commits.

Abstract

In software development, developers frequently apply maintenance activities to the source code that change a few lines by a single commit. A good understanding of the characteristics of such small changes can support quality assurance approaches (e.g., automated program repair), as it is likely that small changes are addressing deficiencies in other changes; thus, understanding the reasons for creating small changes can help understand the types of errors introduced. Eventually, these reasons and the types of errors can be used to enhance quality assurance approaches for improving code quality. While prior studies used code churns to characterize and investigate the small changes, such a definition has a critical limitation. Specifically, it loses the information of changed tokens in a line. For example, this definition fails to distinguish the following two one-line changes: (1) changing a string literal to fix a displayed message and (2) changing a function call and adding a new parameter. These are definitely maintenance activities, but we deduce that researchers and practitioners are interested in supporting the latter change. To address this limitation, in this paper, we define micro commits, a type of small change based on changed tokens. Our goal is to quantify small changes using changed tokens. Changed tokens allow us to identify small changes more precisely. In fact, this token-level definition can distinguish the above example. We investigate defined micro commits in four OSS projects and understand their characteristics as the first empirical study on token-based micro commits. We find that micro commits mainly replace a single name or literal token, and micro commits are more likely used to fix bugs. Additionally, we propose the use of token-based information to support software engineering approaches in which very small changes significantly affect their effectiveness.

An Empirical Study of Token-based Micro Commits

TL;DR

This study defines micro commits as changes that add or remove at most five tokens, enabling token-level analysis of very small code changes to overcome line-focused limitations. Using cregit/srcML to produce token-based diffs, the authors analyze four large OSS projects (Java and C) to quantify frequency, token-types affected, and typical change patterns, finding micro commits constitute 7.45%–17.95% of commits and commonly replace a single name or literal token, with many fixes related to bugs. They show that roughly 90% of one-line commits are micro commits, but only 40–50% of micro commits are one-line commits, and about 30–40% of micro commits span multiple hunks, revealing insights not visible with line-based analysis. The work highlights the value of token-level metrics for program repair and QA, provides a replication package, and discusses implications for extending complexity metrics and future semantic-based definitions of micro commits.

Abstract

In software development, developers frequently apply maintenance activities to the source code that change a few lines by a single commit. A good understanding of the characteristics of such small changes can support quality assurance approaches (e.g., automated program repair), as it is likely that small changes are addressing deficiencies in other changes; thus, understanding the reasons for creating small changes can help understand the types of errors introduced. Eventually, these reasons and the types of errors can be used to enhance quality assurance approaches for improving code quality. While prior studies used code churns to characterize and investigate the small changes, such a definition has a critical limitation. Specifically, it loses the information of changed tokens in a line. For example, this definition fails to distinguish the following two one-line changes: (1) changing a string literal to fix a displayed message and (2) changing a function call and adding a new parameter. These are definitely maintenance activities, but we deduce that researchers and practitioners are interested in supporting the latter change. To address this limitation, in this paper, we define micro commits, a type of small change based on changed tokens. Our goal is to quantify small changes using changed tokens. Changed tokens allow us to identify small changes more precisely. In fact, this token-level definition can distinguish the above example. We investigate defined micro commits in four OSS projects and understand their characteristics as the first empirical study on token-based micro commits. We find that micro commits mainly replace a single name or literal token, and micro commits are more likely used to fix bugs. Additionally, we propose the use of token-based information to support software engineering approaches in which very small changes significantly affect their effectiveness.
Paper Structure (30 sections, 6 figures, 10 tables)

This paper contains 30 sections, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Proportions of changed token types ($>5\%$)
  • Figure 2: Numbers of changed tokens
  • Figure 3: Proportion of one-line commits by the number of tokens added or removed. The x and y-axis show the added and deleted tokens, and each cell indicates the proportion of commits.
  • Figure 4: Accumulated distribution of one-line commits in terms of the maximum number of added or removed tokens
  • Figure 5: Accumulated distribution of micro commits ($N=5$) in terms of the number of hunks included
  • ...and 1 more figures