Table of Contents
Fetching ...

On the Impacts of Contexts on Repository-Level Code Generation

Nam Le Hai, Dung Manh Nguyen, Nghi D. Q. Bui

TL;DR

RepoExec introduces an executable, repository-level code generation benchmark that jointly evaluates functional correctness and dependency utilization through a novel Dependency Invocation Rate (DIR) metric. The framework automatically extracts functions and cross-file dependencies, generates high-coverage tests, and evaluates generated code in an executable environment to ensure alignment with developer intent. Empirical results show pretrained LLMs excel at correctness (pass@k) whereas instruction-tuned models better leverage dependencies (DIR) and debugging capabilities, with multi-round debugging and dependency-aware instruction tuning further boosting performance. The work provides a scalable evaluation pipeline and a publicly released dataset to advance reliable CodeLLMs for real-world software development tasks.

Abstract

CodeLLMs have gained widespread adoption for code generation tasks, yet their capacity to handle repository-level code generation with complex contextual dependencies remains underexplored. Our work underscores the critical importance of leveraging repository-level contexts to generate executable and functionally correct code. We present RepoExec, a novel benchmark designed to evaluate repository-level code generation, with a focus on three key aspects: executability, functional correctness through comprehensive test case generation, and accurate utilization of cross-file contexts. Our study examines a controlled scenario where developers specify essential code dependencies (contexts), challenging models to integrate them effectively. Additionally, we introduce an instruction-tuned dataset that enhances CodeLLMs' ability to leverage dependencies, along with a new metric, Dependency Invocation Rate (DIR), to quantify context utilization. Experimental results reveal that while pretrained LLMs demonstrate superior performance in terms of correctness, instruction-tuned models excel in context utilization and debugging capabilities. RepoExec offers a comprehensive evaluation framework for assessing code functionality and alignment with developer intent, thereby advancing the development of more reliable CodeLLMs for real-world applications. The dataset and source code are available at https://github.com/FSoft-AI4Code/RepoExec.

On the Impacts of Contexts on Repository-Level Code Generation

TL;DR

RepoExec introduces an executable, repository-level code generation benchmark that jointly evaluates functional correctness and dependency utilization through a novel Dependency Invocation Rate (DIR) metric. The framework automatically extracts functions and cross-file dependencies, generates high-coverage tests, and evaluates generated code in an executable environment to ensure alignment with developer intent. Empirical results show pretrained LLMs excel at correctness (pass@k) whereas instruction-tuned models better leverage dependencies (DIR) and debugging capabilities, with multi-round debugging and dependency-aware instruction tuning further boosting performance. The work provides a scalable evaluation pipeline and a publicly released dataset to advance reliable CodeLLMs for real-world software development tasks.

Abstract

CodeLLMs have gained widespread adoption for code generation tasks, yet their capacity to handle repository-level code generation with complex contextual dependencies remains underexplored. Our work underscores the critical importance of leveraging repository-level contexts to generate executable and functionally correct code. We present RepoExec, a novel benchmark designed to evaluate repository-level code generation, with a focus on three key aspects: executability, functional correctness through comprehensive test case generation, and accurate utilization of cross-file contexts. Our study examines a controlled scenario where developers specify essential code dependencies (contexts), challenging models to integrate them effectively. Additionally, we introduce an instruction-tuned dataset that enhances CodeLLMs' ability to leverage dependencies, along with a new metric, Dependency Invocation Rate (DIR), to quantify context utilization. Experimental results reveal that while pretrained LLMs demonstrate superior performance in terms of correctness, instruction-tuned models excel in context utilization and debugging capabilities. RepoExec offers a comprehensive evaluation framework for assessing code functionality and alignment with developer intent, thereby advancing the development of more reliable CodeLLMs for real-world applications. The dataset and source code are available at https://github.com/FSoft-AI4Code/RepoExec.
Paper Structure (39 sections, 1 equation, 9 figures, 8 tables)

This paper contains 39 sections, 1 equation, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Data Processing Pipeline of RepoExec
  • Figure 2: Illustration of a data instance in RepoExec. The target function signatures and their associated docstrings, which describe the functionality of the functions, are shown in (6). The infile-imports and variable declarations are represented by (5) and (4), respectively. The remaining components, (1), (2), and (3), represent the function and class contexts. Specifically, (1) denotes the class or function signature, (2) may contain the description of the class, and (3) represents the function body of the cross-file function.
  • Figure 3: Correlation between Match-based metrics and Execution-based metric (Pass@1).
  • Figure 4: Match-based metric distributions between Correct and Incorrect solutions
  • Figure 5: Performance of various CodeLMs on RepoExec before (bf-) and after (af-) Coverage Enhancement (CovEn) stage.
  • ...and 4 more figures