Table of Contents
Fetching ...

RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion

Huy N. Phan, Hoang N. Phan, Tien N. Nguyen, Nghi D. Q. Bui

TL;DR

RepoHyper tackles repository-level code completion by constructing a Repo-level Semantic Graph (RSG) to capture global project context and applying a Search-then-Expand retrieval strategy followed by Link Prediction-based re-ranking. This trio enables selective retrieval of program-semantic contexts, improving both context retrieval and end-to-end code completion over traditional similarity-based methods. Empirical evaluation on RepoBench demonstrates substantial gains in retrieval accuracy (up to around 49% relative improvements) and code completion metrics (EM and CodeBLEU), with ablations underscoring the importance of pattern-driven expansion and the link-prediction re-ranking. The approach offers a scalable, language-adaptive framework for leveraging repository-wide semantics in code assistants, with practical impact for developers and tooling around large codebases.

Abstract

Code Large Language Models (CodeLLMs) have demonstrated impressive proficiency in code completion tasks. However, they often fall short of fully understanding the extensive context of a project repository, such as the intricacies of relevant files and class hierarchies, which can result in less precise completions. To overcome these limitations, we present \tool, a multifaceted framework designed to address the complex challenges associated with repository-level code completion. Central to RepoHYPER is the {\em Repo-level Semantic Graph} (RSG), a novel semantic graph structure that encapsulates the vast context of code repositories. Furthermore, RepoHyper leverages Expand and Refine retrieval method, including a graph expansion and a link prediction algorithm applied to the RSG, enabling the effective retrieval and prioritization of relevant code snippets. Our evaluations show that \tool markedly outperforms existing techniques in repository-level code completion, showcasing enhanced accuracy across various datasets when compared to several strong baselines. Our implementation of RepoHYPER can be found at https://github.com/FSoft-AI4Code/RepoHyper.

RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion

TL;DR

RepoHyper tackles repository-level code completion by constructing a Repo-level Semantic Graph (RSG) to capture global project context and applying a Search-then-Expand retrieval strategy followed by Link Prediction-based re-ranking. This trio enables selective retrieval of program-semantic contexts, improving both context retrieval and end-to-end code completion over traditional similarity-based methods. Empirical evaluation on RepoBench demonstrates substantial gains in retrieval accuracy (up to around 49% relative improvements) and code completion metrics (EM and CodeBLEU), with ablations underscoring the importance of pattern-driven expansion and the link-prediction re-ranking. The approach offers a scalable, language-adaptive framework for leveraging repository-wide semantics in code assistants, with practical impact for developers and tooling around large codebases.

Abstract

Code Large Language Models (CodeLLMs) have demonstrated impressive proficiency in code completion tasks. However, they often fall short of fully understanding the extensive context of a project repository, such as the intricacies of relevant files and class hierarchies, which can result in less precise completions. To overcome these limitations, we present \tool, a multifaceted framework designed to address the complex challenges associated with repository-level code completion. Central to RepoHYPER is the {\em Repo-level Semantic Graph} (RSG), a novel semantic graph structure that encapsulates the vast context of code repositories. Furthermore, RepoHyper leverages Expand and Refine retrieval method, including a graph expansion and a link prediction algorithm applied to the RSG, enabling the effective retrieval and prioritization of relevant code snippets. Our evaluations show that \tool markedly outperforms existing techniques in repository-level code completion, showcasing enhanced accuracy across various datasets when compared to several strong baselines. Our implementation of RepoHYPER can be found at https://github.com/FSoft-AI4Code/RepoHyper.
Paper Structure (33 sections, 8 equations, 4 figures, 5 tables)

This paper contains 33 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Illustration of graph-based semantic search versus similarity-based search. The orange block indicates the ground-truth line that needs to complete to call the function get_similarity_metric. Similarity-based methods mistakenly focus on MultiLabelInversionModel class due to its similarity in form with current in-file context, leading to incorrect completions. Conversely, RepoHyper successfully identifies the correct context via first identify the most similar code snippet in the codebase then expand and link.
  • Figure 2: Overall Architecture of RepoHyper. Here we use $K=1$.
  • Figure 3: Retrieval performance comparison between RepoHyper and Similarity-based Semantic Search across different context types. We use kNN search within our RSG with UniXCoder encoder for encoding, this method is denoted as Similarity-based Semantic Search and RepoHyper with same encoder. Please see Appendix \ref{['sec:appendix-ct']} for more details on Context Types.
  • Figure 4: Sample ID 1430 in repository secdev/scapy