GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model
Wei Liu, Ailun Yu, Daoguang Zan, Bo Shen, Wei Zhang, Haiyan Zhao, Zhi Jin, Qianxiang Wang
TL;DR
This work tackles repository-level code completion where general LLMs lack repository-specific knowledge. It introduces GraphCoder, which leverages a Code Context Graph (CCG) to capture structured, cross-statement context and employs a coarse-to-fine retrieval pipeline with a decay-with-distance subgraph edit distance for precise snippet re-ranking. Through RepoEval-Updated and extensive experiments across Python and Java with multiple LLMs, GraphCoder achieves higher exact-match scores for code and identifiers while reducing retrieval time and database size. The results demonstrate strong generalizability across model sizes and repositories, with ablations confirming the critical role of CFG and the coarse-to-fine design in achieving its gains.
Abstract
The performance of repository-level code completion depends upon the effective leverage of both general and repository-specific knowledge. Despite the impressive capability of code LLMs in general code completion tasks, they often exhibit less satisfactory performance on repository-level completion due to the lack of repository-specific knowledge in these LLMs. To address this problem, we propose GraphCoder, a retrieval-augmented code completion framework that leverages LLMs' general code knowledge and the repository-specific knowledge via a graph-based retrieval-generation process. In particular, GraphCoder captures the context of completion target more accurately through code context graph (CCG) that consists of control-flow, data- and control-dependence between code statements, a more structured way to capture the completion target context than the sequence-based context used in existing retrieval-augmented approaches; based on CCG, GraphCoder further employs a coarse-to-fine retrieval process to locate context-similar code snippets with the completion target from the current repository. Experimental results demonstrate both the effectiveness and efficiency of GraphCoder: Compared to baseline retrieval-augmented methods, GraphCoder achieves higher exact match (EM) on average, with increases of +6.06 in code match and +6.23 in identifier match, while using less time and space.
