Table of Contents
Fetching ...

Back to the Basics: Rethinking Issue-Commit Linking with LLM-Assisted Retrieval

Huihui Huang, Ratnadira Widyasari, Ting Zhang, Ivana Clairine Irsan, Jieke Shi, Han Wei Ang, Frank Liauw, Eng Lieh Ouh, Lwin Khin Shar, Hong Jin Kang, David Lo

TL;DR

This work tackles issue-commit linking by exposing realism gaps in prior evaluations and introducing Realistic Distribution Setting (RDS) to build a large, practical benchmark across 20 open-source projects. It demonstrates that a modern IR baseline (vector database) can outperform state-of-the-art deep learning approaches when evaluated under realistic conditions, and that adding a lightweight LLM reranking stage (EasyLink) yields substantial gains, achieving an average Precision@1 of 75.03%. Key findings show EasyLink outperforms previous methods by large margins on realistic data, with the LLM component providing meaningful semantic refinement and the vector-DB stage offering scalable retrieval. The paper offers practical guidelines and an accessible replication package, advancing research in software traceability and issue-commit link recovery.

Abstract

Issue-commit linking, which connects issues with commits that fix them, is crucial for software maintenance. Existing approaches have shown promise in automatically recovering these links. Evaluations of these techniques assess their ability to identify genuine links from plausible but false links. However, these evaluations overlook the fact that, in reality, when a repository has more commits, the presence of more plausible yet unrelated commits may interfere with the tool in differentiating the correct fix commits. To address this, we propose the Realistic Distribution Setting (RDS) and use it to construct a more realistic evaluation dataset that includes 20 open-source projects. By evaluating tools on this dataset, we observe that the performance of the state-of-the-art deep learning-based approach drops by more than half, while the traditional Information Retrieval method, VSM, outperforms it. Inspired by these observations, we propose EasyLink, which utilizes a vector database as a modern Information Retrieval technique. To address the long-standing problem of the semantic gap between issues and commits, EasyLink leverages a large language model to rerank the commits retrieved from the database. Under our evaluation, EasyLink achieves an average Precision@1 of 75.03\%, improving over the state-of-the-art by over four times. Additionally, this paper provides practical guidelines for advancing research in issue-commit link recovery.

Back to the Basics: Rethinking Issue-Commit Linking with LLM-Assisted Retrieval

TL;DR

This work tackles issue-commit linking by exposing realism gaps in prior evaluations and introducing Realistic Distribution Setting (RDS) to build a large, practical benchmark across 20 open-source projects. It demonstrates that a modern IR baseline (vector database) can outperform state-of-the-art deep learning approaches when evaluated under realistic conditions, and that adding a lightweight LLM reranking stage (EasyLink) yields substantial gains, achieving an average Precision@1 of 75.03%. Key findings show EasyLink outperforms previous methods by large margins on realistic data, with the LLM component providing meaningful semantic refinement and the vector-DB stage offering scalable retrieval. The paper offers practical guidelines and an accessible replication package, advancing research in software traceability and issue-commit link recovery.

Abstract

Issue-commit linking, which connects issues with commits that fix them, is crucial for software maintenance. Existing approaches have shown promise in automatically recovering these links. Evaluations of these techniques assess their ability to identify genuine links from plausible but false links. However, these evaluations overlook the fact that, in reality, when a repository has more commits, the presence of more plausible yet unrelated commits may interfere with the tool in differentiating the correct fix commits. To address this, we propose the Realistic Distribution Setting (RDS) and use it to construct a more realistic evaluation dataset that includes 20 open-source projects. By evaluating tools on this dataset, we observe that the performance of the state-of-the-art deep learning-based approach drops by more than half, while the traditional Information Retrieval method, VSM, outperforms it. Inspired by these observations, we propose EasyLink, which utilizes a vector database as a modern Information Retrieval technique. To address the long-standing problem of the semantic gap between issues and commits, EasyLink leverages a large language model to rerank the commits retrieved from the database. Under our evaluation, EasyLink achieves an average Precision@1 of 75.03\%, improving over the state-of-the-art by over four times. Additionally, this paper provides practical guidelines for advancing research in issue-commit link recovery.

Paper Structure

This paper contains 32 sections, 3 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Example of an issue with the incorrect top-ranked commit and the correct commit. An issue-commit linking approach has to bridge the semantic gap and distinguish the correct commit from similar ones in a potentially large set.
  • Figure 2: Illustration of the evaluation limitation in prior work. $C_1$ and $C_2$ are the fix commits of issue $I_1$. False links will be constructed for $I_1$. $C_3$ and $C_4$ are commits already linked to the issue in the true links dataset. $C_a$, $C_b$, $C_c$, and $C_d$ are commits present in the repository but not in the true links dataset, and they are ensured not to be the fix commit of $I_1$.
  • Figure 3: Overview of EasyLink. EasyLink consists of two key steps—the first step utilizes a vector database to retrieve initial ranked results, and the second step prompts an LLM to rerank the results.
  • Figure 4: Effect of varying $k$ in EasyLink during reranking: A higher $k$ slightly raises performance (left axis: P@1, MRR, NDCG@1) while significantly increasing test time cost (right axis).