SaraCoder: Orchestrating Semantic and Structural Cues for Resource-Optimized Repository-Level Code Completion
Xiaohan Chen, Zhongying Pan, Quan Feng, Yu Tian, Shuqun Yang, Mengru Wang, Lina Gong, Yuxia Geng, Piji Li, Xiang Chen
TL;DR
SaraCoder tackles repository-level code completion under constrained context windows by integrating semantic and structural cues through Hierarchical Feature Optimization, MD5-based deduplication, a graph-based Decaying Subgraph Edit Distance, and an External-Aware Identifier Disambiguator. The framework refines retrieval results across multiple dimensions and generates well-structured prompts that improve accuracy and efficiency, demonstrated on CrossCodeEval and RepoEval-Updated across Python and Java. Key contributions include novel HF_OP components, topology-aware similarity, deduplication, and EAID, with empirical evidence of superior performance and resource efficiency, plus demonstrable synergy with existing cross-file methods. The work offers a practical pathway to more trustworthy, scalable repository-level code completion in real-world large-code environments.
Abstract
Despite Retrieval-Augmented Generation improving code completion, traditional retrieval methods struggle with information redundancy and a lack of diversity within limited context windows. To solve this, we propose a resource-optimized retrieval augmentation method, SaraCoder. It maximizes information diversity and representativeness in a limited context window, significantly boosting the accuracy and reliability of repository-level code completion. Its core Hierarchical Feature Optimization module systematically refines candidates by distilling deep semantic relationships, pruning exact duplicates, assessing structural similarity with a novel graph-based metric that weighs edits by their topological importance, and reranking results to maximize both relevance and diversity. Furthermore, an External-Aware Identifier Disambiguator module accurately resolves cross-file symbol ambiguity via dependency analysis. Extensive experiments on the challenging CrossCodeEval and RepoEval-Updated benchmarks demonstrate that SaraCoder outperforms existing baselines across multiple programming languages and models. Our work proves that systematically refining retrieval results across multiple dimensions provides a new paradigm for building more accurate and resource-optimized repository-level code completion systems.
