Table of Contents
Fetching ...

GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model

Wei Liu, Ailun Yu, Daoguang Zan, Bo Shen, Wei Zhang, Haiyan Zhao, Zhi Jin, Qianxiang Wang

TL;DR

This work tackles repository-level code completion where general LLMs lack repository-specific knowledge. It introduces GraphCoder, which leverages a Code Context Graph (CCG) to capture structured, cross-statement context and employs a coarse-to-fine retrieval pipeline with a decay-with-distance subgraph edit distance for precise snippet re-ranking. Through RepoEval-Updated and extensive experiments across Python and Java with multiple LLMs, GraphCoder achieves higher exact-match scores for code and identifiers while reducing retrieval time and database size. The results demonstrate strong generalizability across model sizes and repositories, with ablations confirming the critical role of CFG and the coarse-to-fine design in achieving its gains.

Abstract

The performance of repository-level code completion depends upon the effective leverage of both general and repository-specific knowledge. Despite the impressive capability of code LLMs in general code completion tasks, they often exhibit less satisfactory performance on repository-level completion due to the lack of repository-specific knowledge in these LLMs. To address this problem, we propose GraphCoder, a retrieval-augmented code completion framework that leverages LLMs' general code knowledge and the repository-specific knowledge via a graph-based retrieval-generation process. In particular, GraphCoder captures the context of completion target more accurately through code context graph (CCG) that consists of control-flow, data- and control-dependence between code statements, a more structured way to capture the completion target context than the sequence-based context used in existing retrieval-augmented approaches; based on CCG, GraphCoder further employs a coarse-to-fine retrieval process to locate context-similar code snippets with the completion target from the current repository. Experimental results demonstrate both the effectiveness and efficiency of GraphCoder: Compared to baseline retrieval-augmented methods, GraphCoder achieves higher exact match (EM) on average, with increases of +6.06 in code match and +6.23 in identifier match, while using less time and space.

GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model

TL;DR

This work tackles repository-level code completion where general LLMs lack repository-specific knowledge. It introduces GraphCoder, which leverages a Code Context Graph (CCG) to capture structured, cross-statement context and employs a coarse-to-fine retrieval pipeline with a decay-with-distance subgraph edit distance for precise snippet re-ranking. Through RepoEval-Updated and extensive experiments across Python and Java with multiple LLMs, GraphCoder achieves higher exact-match scores for code and identifiers while reducing retrieval time and database size. The results demonstrate strong generalizability across model sizes and repositories, with ablations confirming the critical role of CFG and the coarse-to-fine design in achieving its gains.

Abstract

The performance of repository-level code completion depends upon the effective leverage of both general and repository-specific knowledge. Despite the impressive capability of code LLMs in general code completion tasks, they often exhibit less satisfactory performance on repository-level completion due to the lack of repository-specific knowledge in these LLMs. To address this problem, we propose GraphCoder, a retrieval-augmented code completion framework that leverages LLMs' general code knowledge and the repository-specific knowledge via a graph-based retrieval-generation process. In particular, GraphCoder captures the context of completion target more accurately through code context graph (CCG) that consists of control-flow, data- and control-dependence between code statements, a more structured way to capture the completion target context than the sequence-based context used in existing retrieval-augmented approaches; based on CCG, GraphCoder further employs a coarse-to-fine retrieval process to locate context-similar code snippets with the completion target from the current repository. Experimental results demonstrate both the effectiveness and efficiency of GraphCoder: Compared to baseline retrieval-augmented methods, GraphCoder achieves higher exact match (EM) on average, with increases of +6.06 in code match and +6.23 in identifier match, while using less time and space.
Paper Structure (36 sections, 1 equation, 10 figures, 5 tables, 2 algorithms)

This paper contains 36 sections, 1 equation, 10 figures, 5 tables, 2 algorithms.

Figures (10)

  • Figure 1: An example of the code context graph (CCG) and its CCG slice with statement of interest $\tilde{x}=13$.
  • Figure 2: An illustration of GraphCoder framework.
  • Figure 3: Prompt template used in GraphCoder.
  • Figure 4: Venn diagram of completion results on GPT3.5-Turbo-Instruct model of different methods. It shows the number of tasks that are completed correctly.
  • Figure 5: A qualitative example demonstrating the effectiveness of GraphCoder.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Definition 1: Code Context Graph
  • Definition 2: CCG Slice