RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion
Huy N. Phan, Hoang N. Phan, Tien N. Nguyen, Nghi D. Q. Bui
TL;DR
RepoHyper tackles repository-level code completion by constructing a Repo-level Semantic Graph (RSG) to capture global project context and applying a Search-then-Expand retrieval strategy followed by Link Prediction-based re-ranking. This trio enables selective retrieval of program-semantic contexts, improving both context retrieval and end-to-end code completion over traditional similarity-based methods. Empirical evaluation on RepoBench demonstrates substantial gains in retrieval accuracy (up to around 49% relative improvements) and code completion metrics (EM and CodeBLEU), with ablations underscoring the importance of pattern-driven expansion and the link-prediction re-ranking. The approach offers a scalable, language-adaptive framework for leveraging repository-wide semantics in code assistants, with practical impact for developers and tooling around large codebases.
Abstract
Code Large Language Models (CodeLLMs) have demonstrated impressive proficiency in code completion tasks. However, they often fall short of fully understanding the extensive context of a project repository, such as the intricacies of relevant files and class hierarchies, which can result in less precise completions. To overcome these limitations, we present \tool, a multifaceted framework designed to address the complex challenges associated with repository-level code completion. Central to RepoHYPER is the {\em Repo-level Semantic Graph} (RSG), a novel semantic graph structure that encapsulates the vast context of code repositories. Furthermore, RepoHyper leverages Expand and Refine retrieval method, including a graph expansion and a link prediction algorithm applied to the RSG, enabling the effective retrieval and prioritization of relevant code snippets. Our evaluations show that \tool markedly outperforms existing techniques in repository-level code completion, showcasing enhanced accuracy across various datasets when compared to several strong baselines. Our implementation of RepoHYPER can be found at https://github.com/FSoft-AI4Code/RepoHyper.
