Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?

Zhenyu Pan; Rongyu Cao; Yongchang Cao; Yingwei Ma; Binhua Li; Fei Huang; Han Liu; Yongbin Li

Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?

Zhenyu Pan, Rongyu Cao, Yongchang Cao, Yingwei Ma, Binhua Li, Fei Huang, Han Liu, Yongbin Li

TL;DR

The Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework, assesses whether a code completion tool can capture a developer's immediate intent and suggest appropriate code across diverse contexts, providing a more realistic benchmark for code completion in modern software development.

Abstract

Code completion, a key downstream task in code generation, is one of the most frequent and impactful methods for enhancing developer productivity in software development. As intelligent completion tools evolve, we need a robust evaluation benchmark that enables meaningful comparisons between products and guides future advancements. However, existing benchmarks focus more on coarse-grained tasks without industrial analysis resembling general code generation rather than the real-world scenarios developers encounter. Moreover, these benchmarks often rely on costly and time-consuming human annotation, and the standalone test cases fail to leverage minimal tests for maximum repository-level understanding and code coverage. To address these limitations, we first analyze business data from an industrial code completion tool and redefine the evaluation criteria to better align with the developer's intent and desired completion behavior throughout the coding process. Based on these insights, we introduce Codev-Agent, an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage, ensuring fair and effective comparisons. Using Codev-Agent, we present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Bench assesses whether a code completion tool can capture a developer's immediate intent and suggest appropriate code across diverse contexts, providing a more realistic benchmark for code completion in modern software development.

Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?

TL;DR

Abstract

Paper Structure (45 sections, 1 equation, 9 figures, 7 tables)

This paper contains 45 sections, 1 equation, 9 figures, 7 tables.

Introduction
Background and Related Work
LLM for Code Generation
Benchmark for Code Generation
Benchmark for Code Completion
Methodology
Product Business Data Analysis
Codev-Agent
Automated Repository Crawling
Execution Environment Setup
Dynamic and Static Call Chain Analysis
Test Sample Generation
Evaluation Execution
Codev-Bench
Features of Codev-Bench
...and 30 more sections

Figures (9)

Figure 1: Business data analysis. (A) code block categories distribution. (B) Completion lines distribution. (C) Prompt length distribution.
Figure 2: Overview of Codev-Agent. (a) A LLM-based crawler selects up-to-date, lightweight, highly-starred, non-forked repositories with unit test files. (b) Codev-Agent utilize LLM (Qwen) to read README files, generating installation commands and iteratively refining them based on logs and error reports from running unit tests, until successful execution. (c) Codev-Agent combines dynamic data flow analysis during unit test execution with static code parsing (AST), creating a fused code chain that reflects both dynamic and static perspectives. (d) Test Sample Generation extracts test samples from the fusion results based on real-world business scenarios, delivering our Codev-Bench.
Figure 3: Final test sample in JSON format.
Figure 4: A sample of prompting under-evaluate LLM to complete. The complete prompt is shown in Appendix \ref{['appendix:complete_prompt']}
Figure 5: Comparing average lines of generated code by different models in four scenarios.
...and 4 more figures

Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?

TL;DR

Abstract

Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?

Authors

TL;DR

Abstract

Table of Contents

Figures (9)