Table of Contents
Fetching ...

Repository-level Code Search with Neural Retrieval Methods

Siddharth Gandhi, Luyu Gao, Jamie Callan

TL;DR

This work tackles repository-level code search by defining the task as retrieving files from the current repository snapshot that address a given user query or bug, and proposes a multi-stage pipeline that first uses BM25 on past commit messages, then neural CommitReranker and CodeReranker to refine rankings. By training across diverse repositories and leveraging commit histories, the approach learns cross-project patterns that improve surface relevance and aid LLM-based bug fixing. Empirical evaluation on a new dataset from seven popular OSS repos shows substantial gains (up to ~80% on $MAP$, $MRR$, and $P@1$) over BM25, with CodeReranker delivering the strongest gains and cross-repository training improving generalization. The findings suggest a viable path to supply high-quality, concise context to large language models, enabling more accurate code understanding and bug resolution in real-world software development.

Abstract

This paper presents a multi-stage reranking system for repository-level code search, which leverages the vastly available commit histories of large open-source repositories to aid in bug fixing. We define the task of repository-level code search as retrieving the set of files from the current state of a code repository that are most relevant to addressing a user's question or bug. The proposed approach combines BM25-based retrieval over commit messages with neural reranking using CodeBERT to identify the most pertinent files. By learning patterns from diverse repositories and their commit histories, the system can surface relevant files for the task at hand. The system leverages both commit messages and source code for relevance matching, and is evaluated in both normal and oracle settings. Experiments on a new dataset created from 7 popular open-source repositories demonstrate substantial improvements of up to 80% in MAP, MRR and P@1 over the BM25 baseline, across a diverse set of queries, demonstrating the effectiveness this approach. We hope this work aids LLM agents as a tool for better code search and understanding. Our code and results obtained are publicly available.

Repository-level Code Search with Neural Retrieval Methods

TL;DR

This work tackles repository-level code search by defining the task as retrieving files from the current repository snapshot that address a given user query or bug, and proposes a multi-stage pipeline that first uses BM25 on past commit messages, then neural CommitReranker and CodeReranker to refine rankings. By training across diverse repositories and leveraging commit histories, the approach learns cross-project patterns that improve surface relevance and aid LLM-based bug fixing. Empirical evaluation on a new dataset from seven popular OSS repos shows substantial gains (up to ~80% on , , and ) over BM25, with CodeReranker delivering the strongest gains and cross-repository training improving generalization. The findings suggest a viable path to supply high-quality, concise context to large language models, enabling more accurate code understanding and bug resolution in real-world software development.

Abstract

This paper presents a multi-stage reranking system for repository-level code search, which leverages the vastly available commit histories of large open-source repositories to aid in bug fixing. We define the task of repository-level code search as retrieving the set of files from the current state of a code repository that are most relevant to addressing a user's question or bug. The proposed approach combines BM25-based retrieval over commit messages with neural reranking using CodeBERT to identify the most pertinent files. By learning patterns from diverse repositories and their commit histories, the system can surface relevant files for the task at hand. The system leverages both commit messages and source code for relevance matching, and is evaluated in both normal and oracle settings. Experiments on a new dataset created from 7 popular open-source repositories demonstrate substantial improvements of up to 80% in MAP, MRR and P@1 over the BM25 baseline, across a diverse set of queries, demonstrating the effectiveness this approach. We hope this work aids LLM agents as a tool for better code search and understanding. Our code and results obtained are publicly available.

Paper Structure

This paper contains 46 sections, 5 figures, 10 tables.

Figures (5)

  • Figure 1: An overview of our system
  • Figure 2: Histogram of Number of files per FID across 7 repositories
  • Figure 3: Visualization of Table \ref{['tab:combined_results']}. Bold lines are averages across 7 repositories, light background lines are individual performance of each constituent repository. Performance does not improve with increasing reranking depths. CodeReranker@100 performs almost to Full Pipeline, probably because of minor R@100 increase with intermediate CommitReranker@1000.
  • Figure 4: Visualization of Table \ref{['tab:fbr_all']}a. CodeReranker (blue) is significantly better compared to CommitReranker(red) in all settings, however the Full Pipeline is still surprisingly better.
  • Figure 5: Visualization of Table \ref{['tab:fbr_all']}b. Notice how much the grey curve is lifted by both red and blue lines. CodeReranker (blue) improves the same pre-ranking (grey) significantly more than CommitReranker (red).