ExecRepoBench: Multi-level Executable Code Completion Evaluation
Jian Yang, Jiajun Zhang, Jiaxi Yang, Ke Jin, Lei Zhang, Qiyao Peng, Ken Deng, Yibo Miao, Tianyu Liu, Zeyu Cui, Binyuan Hui, Junyang Lin
TL;DR
This work introduces ExecRepoBench, an executable repository-level benchmark for code completion that captures real-world, multi-file dependencies by leveraging 1.2K samples from 50 active Python repos. It pairs ExecRepoBench with Repo-Instruct, a multi-level grammar-based instruction corpus that masks code at expression, statement, and function levels via ASTs, to train a 7B-parameter open-source LLM, Qwen2.5-Coder-Instruct-C. Fine-tuned on nearly 3M instruction-sample completions, this model achieves state-of-the-art performance on ExecRepoBench and MultiPL-E, underscoring the value of repository-level context and executable evaluation. The results highlight the importance of execution-based metrics over purely string-based measures and demonstrate the practical viability of deploying a local, high-performance code-completion service. Overall, the framework advances code completion research by aligning benchmarks with real development workflows and enabling robust, context-aware coding assistance.
Abstract
Code completion has become an essential tool for daily software development. Existing evaluation benchmarks often employ static methods that do not fully capture the dynamic nature of real-world coding environments and face significant challenges, including limited context length, reliance on superficial evaluation metrics, and potential overfitting to training datasets. In this work, we introduce a novel framework for enhancing code completion in software development through the creation of a repository-level benchmark ExecRepoBench and the instruction corpora Repo-Instruct, aim at improving the functionality of open-source large language models (LLMs) in real-world coding scenarios that involve complex interdependencies across multiple files. ExecRepoBench includes 1.2K samples from active Python repositories. Plus, we present a multi-level grammar-based completion methodology conditioned on the abstract syntax tree to mask code fragments at various logical units (e.g. statements, expressions, and functions). Then, we fine-tune the open-source LLM with 7B parameters on Repo-Instruct to produce a strong code completion baseline model Qwen2.5-Coder-Instruct-C based on the open-source model. Qwen2.5-Coder-Instruct-C is rigorously evaluated against existing benchmarks, including MultiPL-E and ExecRepoBench, which consistently outperforms prior baselines across all programming languages. The deployment of \ourmethod{} can be used as a high-performance, local service for programming development\footnote{\url{https://execrepobench.github.io/}}.
