An Empirical Analysis of Git Commit Logs for Potential Inconsistency in Code Clones
Reishi Yokomori, Katsuro Inoue
TL;DR
This study empirically analyzes commit logs of code clone pairs across 45 Apache Java repositories to quantify how often clones change, how often changes are co-changed, and how frequently those co-changes may be inconsistent. Using CCFinderSW for clone detection and git-log -L to extract per-snippet histories, the authors introduce a patch-difference approach with a $0.4$ similarity threshold to flag concerning co-changes. Key findings show clone snippets are changed infrequently (typically 2–3 times), about half of clone commits are co-changed, and 10–20% of co-changed commits are potentially concerning, with 35–65% of clone pairs deemed concerning across repositories. The work informs practical clone-management tooling by advocating lightweight, integrated warning features that surface potential inconsistencies during development and suggest directions for more nuanced, context-aware analysis.
Abstract
Code clones are code snippets that are identical or similar to other snippets within the same or different files. They are often created through copy-and-paste practices and modified during development and maintenance activities. Since a pair of code clones, known as a clone pair, has a possible logical coupling between them, it is expected that changes to each snippet are made simultaneously (co-changed) and consistently. There is extensive research on code clones, including studies related to the co-change of clones; however, detailed analysis of commit logs for code clone pairs has been limited. In this paper, we investigate the commit logs of code snippets from clone pairs, using the git-log command to extract changes to cloned code snippets. We analyzed 45 repositories owned by the Apache Software Foundation on GitHub and addressed three research questions regarding commit frequency, co-change ratio, and commit patterns. Our findings indicate that (1) on average, clone snippets are changed infrequently, typically only two or three times throughout their lifetime, (2) the ratio of co-changes is about half of all clone changes, with 10-20\% of co-changed commits being concerning (potentially inconsistent), and (3) 35-65\% of all clone pairs being classified as concerning clone pairs (potentially inconsistent clone pairs). These results suggest the need for a consistent management system through the commit timeline of clones.
