Table of Contents
Fetching ...

SaraCoder: Orchestrating Semantic and Structural Cues for Resource-Optimized Repository-Level Code Completion

Xiaohan Chen, Zhongying Pan, Quan Feng, Yu Tian, Shuqun Yang, Mengru Wang, Lina Gong, Yuxia Geng, Piji Li, Xiang Chen

TL;DR

SaraCoder tackles repository-level code completion under constrained context windows by integrating semantic and structural cues through Hierarchical Feature Optimization, MD5-based deduplication, a graph-based Decaying Subgraph Edit Distance, and an External-Aware Identifier Disambiguator. The framework refines retrieval results across multiple dimensions and generates well-structured prompts that improve accuracy and efficiency, demonstrated on CrossCodeEval and RepoEval-Updated across Python and Java. Key contributions include novel HF_OP components, topology-aware similarity, deduplication, and EAID, with empirical evidence of superior performance and resource efficiency, plus demonstrable synergy with existing cross-file methods. The work offers a practical pathway to more trustworthy, scalable repository-level code completion in real-world large-code environments.

Abstract

Despite Retrieval-Augmented Generation improving code completion, traditional retrieval methods struggle with information redundancy and a lack of diversity within limited context windows. To solve this, we propose a resource-optimized retrieval augmentation method, SaraCoder. It maximizes information diversity and representativeness in a limited context window, significantly boosting the accuracy and reliability of repository-level code completion. Its core Hierarchical Feature Optimization module systematically refines candidates by distilling deep semantic relationships, pruning exact duplicates, assessing structural similarity with a novel graph-based metric that weighs edits by their topological importance, and reranking results to maximize both relevance and diversity. Furthermore, an External-Aware Identifier Disambiguator module accurately resolves cross-file symbol ambiguity via dependency analysis. Extensive experiments on the challenging CrossCodeEval and RepoEval-Updated benchmarks demonstrate that SaraCoder outperforms existing baselines across multiple programming languages and models. Our work proves that systematically refining retrieval results across multiple dimensions provides a new paradigm for building more accurate and resource-optimized repository-level code completion systems.

SaraCoder: Orchestrating Semantic and Structural Cues for Resource-Optimized Repository-Level Code Completion

TL;DR

SaraCoder tackles repository-level code completion under constrained context windows by integrating semantic and structural cues through Hierarchical Feature Optimization, MD5-based deduplication, a graph-based Decaying Subgraph Edit Distance, and an External-Aware Identifier Disambiguator. The framework refines retrieval results across multiple dimensions and generates well-structured prompts that improve accuracy and efficiency, demonstrated on CrossCodeEval and RepoEval-Updated across Python and Java. Key contributions include novel HF_OP components, topology-aware similarity, deduplication, and EAID, with empirical evidence of superior performance and resource efficiency, plus demonstrable synergy with existing cross-file methods. The work offers a practical pathway to more trustworthy, scalable repository-level code completion in real-world large-code environments.

Abstract

Despite Retrieval-Augmented Generation improving code completion, traditional retrieval methods struggle with information redundancy and a lack of diversity within limited context windows. To solve this, we propose a resource-optimized retrieval augmentation method, SaraCoder. It maximizes information diversity and representativeness in a limited context window, significantly boosting the accuracy and reliability of repository-level code completion. Its core Hierarchical Feature Optimization module systematically refines candidates by distilling deep semantic relationships, pruning exact duplicates, assessing structural similarity with a novel graph-based metric that weighs edits by their topological importance, and reranking results to maximize both relevance and diversity. Furthermore, an External-Aware Identifier Disambiguator module accurately resolves cross-file symbol ambiguity via dependency analysis. Extensive experiments on the challenging CrossCodeEval and RepoEval-Updated benchmarks demonstrate that SaraCoder outperforms existing baselines across multiple programming languages and models. Our work proves that systematically refining retrieval results across multiple dimensions provides a new paradigm for building more accurate and resource-optimized repository-level code completion systems.

Paper Structure

This paper contains 46 sections, 2 equations, 6 figures, 9 tables, 3 algorithms.

Figures (6)

  • Figure 1: The pitfalls of pure similarity retrieval and the highlights of SaraCoder. Pink boxes illustrate traditional retrieval results based purely on surface similarity, while green boxes demonstrate results from our method SaraCoder.
  • Figure 2: An illustration of SaraCoder framework. (1) Database Construction. This phase constructs a key-value codebase. This involves using a slicing algorithm to create induced graph slices, which are then precisely mapped to source code snippets. (2) Code Retrieval. This phase takes code context as input and retrieves similar code, then refines suggestions via Hierarchical Feature Optimization. Concurrently, an External-Aware Identifier Disambiguator clarifies external symbols via dependency analysis, delivering highly accurate candidates. (3) Code Generation. This phase generates prompts by integrating outputs from code retrieval with the code completion context. These prompts are then fed into an LLM to predict completion statements.
  • Figure 3: Prompt template.
  • Figure 4: Impact of top_k on CrossCodeEval. (The two on the left are Java tasks, and the two on the right are Python tasks.)
  • Figure 5: Ablation study. (Each three-data-point group represents CodeGen2-7B, CodeGen2.5-7B, and CodeLlama-7B-Instruct models. Bar lengths show their average performance, with I-shaped error bars indicating standard deviation)
  • ...and 1 more figures