Table of Contents
Fetching ...

HumanEvo: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation

Dewu Zheng, Yanlin Wang, Ensheng Shi, Ruikai Zhang, Yuchi Ma, Hongyu Zhang, Zibin Zheng

TL;DR

This work identifies a critical flaw in existing repository-level code generation benchmarks: they ignore software evolution, producing inflated assessments of LLM performance. It introduces HumanEvo, an evolution-aware benchmark that rolls back repositories to the state before target commits and evaluates code generation via execution-based tests, across Python and Java with multiple docstring styles and dependency levels. Across seven LLMs, the study shows that evolution-ignored settings overestimate capabilities by a substantial margin and that performance degrades as dependencies become more complex and the project evolves. The authors provide a ready-to-use benchmarking toolkit and concrete recommendations to ensure more realistic evaluation, advancing the practical assessment of LLM-driven repository-level code generation.

Abstract

To evaluate the repository-level code generation capabilities of Large Language Models (LLMs) in complex real-world software development scenarios, many evaluation methods have been developed. These methods typically leverage contextual code from the latest version of a project to assist LLMs in accurately generating the desired function. However, such evaluation methods fail to consider the dynamic evolution of software projects over time, which we refer to as evolution-ignored settings. This in turn results in inaccurate evaluation of LLMs' performance. In this paper, we conduct an empirical study to deeply understand LLMs' code generation performance within settings that reflect the evolution nature of software development. To achieve this, we first construct an evolution-aware repository-level code generation dataset, namely HumanEvo, equipped with an automated execution-based evaluation tool. Second, we manually categorize HumanEvo according to dependency levels to more comprehensively analyze the model's performance in generating functions with different dependency levels. Third, we conduct extensive experiments on HumanEvo with seven representative and diverse LLMs to verify the effectiveness of the proposed benchmark. We obtain several important findings through our experimental study. For example, we find that previous evolution-ignored evaluation methods result in inflated performance of LLMs, with performance overestimations ranging from 10.0% to 61.1% under different context acquisition methods, compared to the evolution-aware evaluation approach. Based on the findings, we give actionable suggestions for more realistic evaluation of LLMs on code generation. We also build a shared evolution-aware code generation toolbox to facilitate future research.

HumanEvo: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation

TL;DR

This work identifies a critical flaw in existing repository-level code generation benchmarks: they ignore software evolution, producing inflated assessments of LLM performance. It introduces HumanEvo, an evolution-aware benchmark that rolls back repositories to the state before target commits and evaluates code generation via execution-based tests, across Python and Java with multiple docstring styles and dependency levels. Across seven LLMs, the study shows that evolution-ignored settings overestimate capabilities by a substantial margin and that performance degrades as dependencies become more complex and the project evolves. The authors provide a ready-to-use benchmarking toolkit and concrete recommendations to ensure more realistic evaluation, advancing the practical assessment of LLM-driven repository-level code generation.

Abstract

To evaluate the repository-level code generation capabilities of Large Language Models (LLMs) in complex real-world software development scenarios, many evaluation methods have been developed. These methods typically leverage contextual code from the latest version of a project to assist LLMs in accurately generating the desired function. However, such evaluation methods fail to consider the dynamic evolution of software projects over time, which we refer to as evolution-ignored settings. This in turn results in inaccurate evaluation of LLMs' performance. In this paper, we conduct an empirical study to deeply understand LLMs' code generation performance within settings that reflect the evolution nature of software development. To achieve this, we first construct an evolution-aware repository-level code generation dataset, namely HumanEvo, equipped with an automated execution-based evaluation tool. Second, we manually categorize HumanEvo according to dependency levels to more comprehensively analyze the model's performance in generating functions with different dependency levels. Third, we conduct extensive experiments on HumanEvo with seven representative and diverse LLMs to verify the effectiveness of the proposed benchmark. We obtain several important findings through our experimental study. For example, we find that previous evolution-ignored evaluation methods result in inflated performance of LLMs, with performance overestimations ranging from 10.0% to 61.1% under different context acquisition methods, compared to the evolution-aware evaluation approach. Based on the findings, we give actionable suggestions for more realistic evaluation of LLMs on code generation. We also build a shared evolution-aware code generation toolbox to facilitate future research.
Paper Structure (25 sections, 5 figures, 6 tables)

This paper contains 25 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Example 1: future context leakage.
  • Figure 2: Example 2: useful context missing.
  • Figure 3: Task instance overview.
  • Figure 4: HumanEvo construction pipeline.
  • Figure 5: LLM's performance when using different project versions as context sources.