CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks
Yiqing Xie, Alex Xie, Divyanshu Sheth, Pengfei Liu, Daniel Fried, Carolyn Rose
TL;DR
The paper addresses the challenge of evaluating code generation beyond static analysis by introducing CodeBenchGen, a framework that uses LLM-driven sandboxing, test generation, and iterative debugging to convert arbitrary code fragments into executable evaluation examples. It builds Exec-CSN from CodeSearchNet, comprising 1,931 executable examples across 367 GitHub repositories, enabling broad domain coverage and realistic evaluation. Through human studies and model benchmarking on 12 open-source and proprietary models, the work demonstrates substantial headroom for improvements and highlights the impact of example complexity, library usage, and revision strategies on performance. The framework and dataset offer a scalable, reproducible path toward more robust, execution-based assessment of real-world code generation capabilities across diverse programming domains.
Abstract
To adequately test modern code generation systems, evaluation benchmarks must execute and test the code generated by the system. However, these execution and testing requirements have largely limited benchmarks to settings where code is easily executable or has human-written tests. To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks from naturally occurring code sources. Specifically, we leverage a large language model (LLM) to sandbox arbitrary pieces of code into evaluation examples, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples involving 293 libraries converted from code in 367 GitHub repositories taken from the Code- SearchNet dataset. To demonstrate the solvability of examples in Exec-CSN, we present a human study demonstrating that 81.3% of the examples can be solved by humans and 61% are rated as "requires effort to solve". We conduct code generation experiments on open-source and proprietary models and analyze the performance of both humans and models. We provide code and data at: https://github.com/yiqingxyq/CodeBenchGen.
