Table of Contents
Fetching ...

Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction

Zhe Zhang, Runlin Liu, Aishan Liu, Xingyu Liu, Xiang Gao, Hailong Sun

TL;DR

Code2Bench introduces Dual Scaling to address the twin deficiencies of static problem sources and superficial testing in code-generation benchmarks. It presents CODE2BENCH as an automated, end-to-end pipeline that dynamically ingests real-world code and enforces rigorous, 100%-branch-coverage testing via Property-Based Testing, yielding CODE2BENCH-2509 across Python and Java. The study reveals a persistent gap between algorithmic synthesis (SC) and API usage (WSC), shows that language ecosystems shape failure modes, and demonstrates that rigorous testing uncovers an illusion of correctness on simpler benchmarks. By providing a scalable, diagnostic framework with reproducible results, Code2Bench aims to redefine robust evaluation for LLMs in software engineering.

Abstract

The evaluation of code-generating Large Language Models (LLMs) is fundamentally constrained by two intertwined challenges: a reliance on static, easily contaminated problem sources and the use of superficial, low-rigor testing. This paper introduces a new benchmark construction philosophy, Dual Scaling, designed to systematically address both limitations. Our approach involves continuously scaling the source of problems from dynamic, real-world code repositories and systematically scaling the rigor of tests via automated, high-coverage Property-Based Testing (PBT). We instantiate this philosophy in CODE2BENCH, an end-to-end framework that leverages Scope Graph analysis for principled dependency classification and a 100% branch coverage quality gate to ensure test suite integrity. Using this framework, we construct CODE2BENCH-2509, a new benchmark suite with native instances in both Python and Java. Our extensive evaluation of 10 state-of-the-art LLMs on CODE2BENCH-2509, powered by a novel "diagnostic fingerprint" visualization, yields three key insights: (1) models exhibit a fundamental performance gap, excelling at API application (Weakly Self-Contained tasks) but struggling with algorithmic synthesis (Self-Contained tasks); (2) a model's performance is profoundly shaped by the target language's ecosystem, a nuance we are the first to systematically quantify; and (3) our rigorous, scaled testing is critical in uncovering an "illusion of correctness" prevalent in simpler benchmarks. Our work presents a robust, scalable, and diagnostic paradigm for the next generation of LLM evaluation in software engineering. The code, data, and results are available at https://code2bench.github.io/.

Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction

TL;DR

Code2Bench introduces Dual Scaling to address the twin deficiencies of static problem sources and superficial testing in code-generation benchmarks. It presents CODE2BENCH as an automated, end-to-end pipeline that dynamically ingests real-world code and enforces rigorous, 100%-branch-coverage testing via Property-Based Testing, yielding CODE2BENCH-2509 across Python and Java. The study reveals a persistent gap between algorithmic synthesis (SC) and API usage (WSC), shows that language ecosystems shape failure modes, and demonstrates that rigorous testing uncovers an illusion of correctness on simpler benchmarks. By providing a scalable, diagnostic framework with reproducible results, Code2Bench aims to redefine robust evaluation for LLMs in software engineering.

Abstract

The evaluation of code-generating Large Language Models (LLMs) is fundamentally constrained by two intertwined challenges: a reliance on static, easily contaminated problem sources and the use of superficial, low-rigor testing. This paper introduces a new benchmark construction philosophy, Dual Scaling, designed to systematically address both limitations. Our approach involves continuously scaling the source of problems from dynamic, real-world code repositories and systematically scaling the rigor of tests via automated, high-coverage Property-Based Testing (PBT). We instantiate this philosophy in CODE2BENCH, an end-to-end framework that leverages Scope Graph analysis for principled dependency classification and a 100% branch coverage quality gate to ensure test suite integrity. Using this framework, we construct CODE2BENCH-2509, a new benchmark suite with native instances in both Python and Java. Our extensive evaluation of 10 state-of-the-art LLMs on CODE2BENCH-2509, powered by a novel "diagnostic fingerprint" visualization, yields three key insights: (1) models exhibit a fundamental performance gap, excelling at API application (Weakly Self-Contained tasks) but struggling with algorithmic synthesis (Self-Contained tasks); (2) a model's performance is profoundly shaped by the target language's ecosystem, a nuance we are the first to systematically quantify; and (3) our rigorous, scaled testing is critical in uncovering an "illusion of correctness" prevalent in simpler benchmarks. Our work presents a robust, scalable, and diagnostic paradigm for the next generation of LLM evaluation in software engineering. The code, data, and results are available at https://code2bench.github.io/.

Paper Structure

This paper contains 74 sections, 1 equation, 5 figures, 17 tables.

Figures (5)

  • Figure 1: Overview of the Code2Bench Framework.
  • Figure 2: The CODE2BENCH multi-dimensional evaluation landscape.
  • Figure 3: Fingerprints across the three evaluation tracks—SC-Python (left), WSC-Python (middle), and SC-Java (right)—shown as ridgeline plots. Each curve captures a model’s outcome distribution, ranging from SyntaxErr to Perfect, with key pass rates annotated.
  • Figure 4: Prevalence of "Near-Perfect" Failures (Pass@ $\geq$98%) in CODE2BENCH.
  • Figure 4: Performance on Evalplus and Code2Bench-2509