On the Impacts of Contexts on Repository-Level Code Generation
Nam Le Hai, Dung Manh Nguyen, Nghi D. Q. Bui
TL;DR
RepoExec introduces an executable, repository-level code generation benchmark that jointly evaluates functional correctness and dependency utilization through a novel Dependency Invocation Rate (DIR) metric. The framework automatically extracts functions and cross-file dependencies, generates high-coverage tests, and evaluates generated code in an executable environment to ensure alignment with developer intent. Empirical results show pretrained LLMs excel at correctness (pass@k) whereas instruction-tuned models better leverage dependencies (DIR) and debugging capabilities, with multi-round debugging and dependency-aware instruction tuning further boosting performance. The work provides a scalable evaluation pipeline and a publicly released dataset to advance reliable CodeLLMs for real-world software development tasks.
Abstract
CodeLLMs have gained widespread adoption for code generation tasks, yet their capacity to handle repository-level code generation with complex contextual dependencies remains underexplored. Our work underscores the critical importance of leveraging repository-level contexts to generate executable and functionally correct code. We present RepoExec, a novel benchmark designed to evaluate repository-level code generation, with a focus on three key aspects: executability, functional correctness through comprehensive test case generation, and accurate utilization of cross-file contexts. Our study examines a controlled scenario where developers specify essential code dependencies (contexts), challenging models to integrate them effectively. Additionally, we introduce an instruction-tuned dataset that enhances CodeLLMs' ability to leverage dependencies, along with a new metric, Dependency Invocation Rate (DIR), to quantify context utilization. Experimental results reveal that while pretrained LLMs demonstrate superior performance in terms of correctness, instruction-tuned models excel in context utilization and debugging capabilities. RepoExec offers a comprehensive evaluation framework for assessing code functionality and alignment with developer intent, thereby advancing the development of more reliable CodeLLMs for real-world applications. The dataset and source code are available at https://github.com/FSoft-AI4Code/RepoExec.
