Using Changeset Descriptions as a Data Source to Assist Feature Location
Muslim Chochlov, Michael English, Jim Buckley
TL;DR
This paper introduces ACIR, a feature-location technique that leverages changeset descriptions from version control as the lexical basis for annotating software artifacts. By partitioning code into artifacts (file or method level) and aggregating relevant changeset texts (most recent or all), ACIR builds an IR corpus indexed with Lucene under a Vector Space Model. An empirical study on Rhino and Mylyn.Tasks assesses efficiency, granularity effects, and changeset-range influence using reenactment of change requests and standard IR metrics (MAP, MRR, effectiveness). Findings show ACIR is competitive with existing text-based FLTs, with method-level granularity reducing developer effort up to 64% and changeset-range impact depending on the project’s history. The work highlights the potential of changeset descriptions as a viable data source for FLT and outlines directions for scaling, evolution-aware selection, and integration with other FL approaches.
Abstract
Feature location attempts to assist developers in discovering functionality in source code. Many textual feature location techniques utilize information retrieval and rely on comments and identifiers of source code to describe software entities. An interesting alternative would be to employ the changeset descriptions of the code altered in that changeset as a data source to describe such software entities. To investigate this we implement a technique utilizing changeset descriptions and conduct an empirical study to observe this technique's overall performance. Moreover, we study how the granularity (i.e. file or method level of software entities) and changeset range inclusion (i.e. most recent or all historical changesets) affect such an approach. The results of a preliminary study with Rhino and Mylyn.Tasks systems suggest that the approach could lead to a potentially efficient feature location technique. They also suggest that it is advantageous in terms of the effort to configure the technique at method level granularity and that older changesets from older systems may reduce the effectiveness of the technique.
