Enhancing Software Maintenance: A Learning to Rank Approach for Co-changed Method Identification

Yiping Jia; Safwat Hassan; Ying Zou

Enhancing Software Maintenance: A Learning to Rank Approach for Co-changed Method Identification

Yiping Jia, Safwat Hassan, Ying Zou

TL;DR

The paper tackles the challenge of identifying co-change relationships by ranking co-changed methods at the pull-request level. It introduces a learning-to-rank (LtR) approach that fuses history-derived signals with static code features, evaluated on 150 open-source Java projects; Random Forest emerged as the strongest model, outperforming baselines by up to $537.5\%$ in $NDCG@5$ and requiring bi-monthly retraining for sustained performance. Ten features (historical, semantic, structural, and clone signals) are collected and refined via correlation analysis, with permutation importance showing that the frequency of past co-changes is the dominant predictor. The approach demonstrates robust performance across varying project characteristics and provides practical guidance on data-historical windows and retraining cadence for real-world deployment. Overall, this work advances software maintenance by enabling more accurate, scalable identification of co-changed methods to better manage dependencies and maintain quality across evolving codebases.

Abstract

With the increasing complexity of large-scale software systems, identifying all necessary modifications for a specific change is challenging. Co-changed methods, which are methods frequently modified together, are crucial for understanding software dependencies. However, existing methods often produce large results with high false positives. Focusing on pull requests instead of individual commits provides a more comprehensive view of related changes, capturing essential co-change relationships. To address these challenges, we propose a learning-to-rank approach that combines source code features and change history to predict and rank co-changed methods at the pull-request level. Experiments on 150 open-source Java projects, totaling 41.5 million lines of code and 634,216 pull requests, show that the Random Forest model outperforms other models by 2.5 to 12.8 percent in NDCG@5. It also surpasses baselines such as file proximity, code clones, FCP2Vec, and StarCoder 2 by 4.7 to 537.5 percent. Models trained on longer historical data (90 to 180 days) perform consistently, while accuracy declines after 60 days, highlighting the need for bi-monthly retraining. This approach provides an effective tool for managing co-changed methods, enabling development teams to handle dependencies and maintain software quality.

Enhancing Software Maintenance: A Learning to Rank Approach for Co-changed Method Identification

TL;DR

Abstract

Enhancing Software Maintenance: A Learning to Rank Approach for Co-changed Method Identification

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)