Repository-level Code Search with Neural Retrieval Methods
Siddharth Gandhi, Luyu Gao, Jamie Callan
TL;DR
This work tackles repository-level code search by defining the task as retrieving files from the current repository snapshot that address a given user query or bug, and proposes a multi-stage pipeline that first uses BM25 on past commit messages, then neural CommitReranker and CodeReranker to refine rankings. By training across diverse repositories and leveraging commit histories, the approach learns cross-project patterns that improve surface relevance and aid LLM-based bug fixing. Empirical evaluation on a new dataset from seven popular OSS repos shows substantial gains (up to ~80% on $MAP$, $MRR$, and $P@1$) over BM25, with CodeReranker delivering the strongest gains and cross-repository training improving generalization. The findings suggest a viable path to supply high-quality, concise context to large language models, enabling more accurate code understanding and bug resolution in real-world software development.
Abstract
This paper presents a multi-stage reranking system for repository-level code search, which leverages the vastly available commit histories of large open-source repositories to aid in bug fixing. We define the task of repository-level code search as retrieving the set of files from the current state of a code repository that are most relevant to addressing a user's question or bug. The proposed approach combines BM25-based retrieval over commit messages with neural reranking using CodeBERT to identify the most pertinent files. By learning patterns from diverse repositories and their commit histories, the system can surface relevant files for the task at hand. The system leverages both commit messages and source code for relevance matching, and is evaluated in both normal and oracle settings. Experiments on a new dataset created from 7 popular open-source repositories demonstrate substantial improvements of up to 80% in MAP, MRR and P@1 over the BM25 baseline, across a diverse set of queries, demonstrating the effectiveness this approach. We hope this work aids LLM agents as a tool for better code search and understanding. Our code and results obtained are publicly available.
