ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

Jia Feng; Jiachen Liu; Cuiyun Gao; Chun Yong Chong; Chaozheng Wang; Shan Gao; Xin Xia

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

Jia Feng, Jiachen Liu, Cuiyun Gao, Chun Yong Chong, Chaozheng Wang, Shan Gao, Xin Xia

TL;DR

This work proposes ComplexCodeEval, a new benchmark for evaluating the performance of large code models (LCMs) in various development scenarios, and conducts an in-depth analysis of the impact of context and data leakage on model performance.

Abstract

In recent years, the application of large language models (LLMs) to code-related tasks has gained significant attention. However, existing evaluation benchmarks often focus on limited scenarios, such as code generation or completion, which do not reflect the diverse challenges developers face in real-world contexts. To address this, we introduce ComplexCodeEval, a benchmark designed to assess LCMs in various development tasks, including code generation, completion, API recommendation, and test case generation. It includes 3,897 Java samples and 7,184 Python samples from high-star GitHub repositories, each annotated with function signatures, docstrings, and API references to simulate real development environments. Our experiments across ten LCMs reveal that context improves performance and that data leakage can lead to overestimation, highlighting the need for more accurate evaluations.

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

TL;DR

Abstract

Paper Structure (30 sections, 5 figures, 8 tables)

This paper contains 30 sections, 5 figures, 8 tables.

Introduction
Background And Related Work
Large Code Models
Benchmarks for Code-Related Tasks
ComplexCodeEval
Dataset Collection
Library Selection.
Repository Selection.
Candidate Methods Extraction.
Dataset Generation
Test Cases Extraction.
Annotations Extraction.
Docstring Generation.
Time tagging.
Benchmark Characteristic
...and 15 more sections

Figures (5)

Figure 1: The process of ComplexCodeEval construction.
Figure 2: An example of a Python test function. The choices function is invoked by filter_spec (line 9), where filter_spec is an instance of NullFieldListFilter. Considering thatNullFieldListFilter is imported from the django_extensions.admin.filter.NullFieldListFilter class (line 2), the original path of choices is set as the imported class.
Figure 3: An example of ComplexCodeEval.
Figure 4: The prompt template used for generating docstrings.
Figure 5: An example of the original docstring and the LLM-generated docstring.

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

TL;DR

Abstract

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

Authors

TL;DR

Abstract

Table of Contents

Figures (5)