CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

Yiqing Xie; Alex Xie; Divyanshu Sheth; Pengfei Liu; Daniel Fried; Carolyn Rose

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

Yiqing Xie, Alex Xie, Divyanshu Sheth, Pengfei Liu, Daniel Fried, Carolyn Rose

TL;DR

The paper addresses the challenge of evaluating code generation beyond static analysis by introducing CodeBenchGen, a framework that uses LLM-driven sandboxing, test generation, and iterative debugging to convert arbitrary code fragments into executable evaluation examples. It builds Exec-CSN from CodeSearchNet, comprising 1,931 executable examples across 367 GitHub repositories, enabling broad domain coverage and realistic evaluation. Through human studies and model benchmarking on 12 open-source and proprietary models, the work demonstrates substantial headroom for improvements and highlights the impact of example complexity, library usage, and revision strategies on performance. The framework and dataset offer a scalable, reproducible path toward more robust, execution-based assessment of real-world code generation capabilities across diverse programming domains.

Abstract

To adequately test modern code generation systems, evaluation benchmarks must execute and test the code generated by the system. However, these execution and testing requirements have largely limited benchmarks to settings where code is easily executable or has human-written tests. To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks from naturally occurring code sources. Specifically, we leverage a large language model (LLM) to sandbox arbitrary pieces of code into evaluation examples, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples involving 293 libraries converted from code in 367 GitHub repositories taken from the Code- SearchNet dataset. To demonstrate the solvability of examples in Exec-CSN, we present a human study demonstrating that 81.3% of the examples can be solved by humans and 61% are rated as "requires effort to solve". We conduct code generation experiments on open-source and proprietary models and analyze the performance of both humans and models. We provide code and data at: https://github.com/yiqingxyq/CodeBenchGen.

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

TL;DR

Abstract

Paper Structure (27 sections, 10 figures, 9 tables)

This paper contains 27 sections, 10 figures, 9 tables.

Introduction
Methodology of CodeBenchGen
Creating a Dataset: Exec-CSN using CodeBenchGen
Quality Verification for Exec-CSN
Diversity Analysis
Realism Analysis
Complexity Analysis
Complexity and Solvability Analysis by Human Study
Code Generation Performance Evaluation
Experimental Setup
Main Results
Results Analysis
Related Work
Code Generation Benchmarks
Automatic Test Generation
...and 12 more sections

Figures (10)

Figure 1: Comparison with existing dataset creation methods.We follow the original paper of RepoEval and R2E and apply their repository filtering strategies on the same repositories we build our dataset from (i.e., 408 repositories in the CodeSearchNet dataset). The final number of repositories in RepoEval and R2E also depend on how much human effort they spend on environment setup, debugging, etc.
Figure 2: The input of CodeBenchGen is an arbitrary code fragment (e.g., code on GitHub), and the output is an evaluation example as illustrated. The context and ground truth are adapted from the input code fragment by an LLM. The instruction and tests are generated based on the adapted code.
Figure 3: Illustration of the sandboxing step. When adapting the input code, we observe that the LLM can successfully adapt (1) local module imports, (2) external API usage, and (3) local file reading.
Figure 4: The framework of CodeBenchGen, which leverages an LLM to convert a code fragment selected by the user to an evaluation example. The framework (1) sandboxes the code fragment to run in an isolated environment, (2) generates tests for the code, (3) iteratively debugs or regenerates the code to ensure its functional correctness, and (4) post-processes the code into an evaluation example.
Figure 5: Domain diversity in different datasets. We check each dataset's coverage of the top-30 most common libraries or topics, which are estimated by frequency in the Stack dataset stack. "CSN" denotes CodeSearchNet, which we use as input to our framework to create Exec-CSN.
...and 5 more figures

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

TL;DR

Abstract

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

Authors

TL;DR

Abstract

Table of Contents

Figures (10)