Table of Contents
Fetching ...

CoreCodeBench: Decoupling Code Intelligence via Fine-Grained Repository-Level Tasks

Lingyue Fu, Hao Guan, Bolun Zhang, Haowei Yuan, Yaoming Zhu, Jun Xu, Zongyu Wang, Lin Qiu, Xunliang Cai, Xuezhi Cao, Weiwen Liu, Weinan Zhang, Yong Yu

TL;DR

CoreCodeBench introduces a fine-grained, repository-level benchmark for code intelligence by decoupling coding tasks into six atomic types (Development, BugFix, TDD) and composite tasks, all derived from identical code contexts. The CorePipe framework automates repository context extraction, atomic task generation, and composite task scaling with controllable difficulty, achieving high data quality and strong robustness to prompt variation. Experiments across SoTA LLMs reveal significant capability misalignment across cognitive dimensions and a marked challenge for interdependent multi-function tasks, underscoring the non-monolithic nature of coding proficiency. The benchmark demonstrates scalability and diagnostic value, but acknowledges limitations such as language scope and dependence on unit tests, outlining a path toward broader language support and automated test generation to sustain lifting code intelligence research.

Abstract

The evaluation of Large Language Models (LLMs) for software engineering has shifted towards complex, repository-level tasks. However, existing benchmarks predominantly rely on coarse-grained pass rates that treat programming proficiency as a monolithic capability, obscuring specific cognitive bottlenecks. Furthermore, the static nature of these benchmarks renders them vulnerable to data contamination and performance saturation. To address these limitations, we introduce CoreCodeBench, a configurable repository-level benchmark designed to dissect coding capabilities through atomized tasks. Leveraging our automated framework, CorePipe, we extract and transform Python repositories into a comprehensive suite of tasks that isolate distinct cognitive demands within identical code contexts. Unlike static evaluations, CoreCodeBench supports controllable difficulty scaling to prevent saturation and ensures superior data quality. It achieves a 78.55% validity yield, significantly surpassing the 31.7% retention rate of SWE-bench-Verified. Extensive experiments with state-of-the-art LLMs reveal a significant capability misalignment, evidenced by distinct ranking shifts across cognitive dimensions. This indicates that coding proficiency is non-monolithic, as strength in one aspect does not necessarily translate to others. These findings underscore the necessity of our fine-grained taxonomy in diagnosing model deficiencies and offer a sustainable, rigorous framework for evolving code intelligence. The code for CorePipe is available at https://github.com/AGI-Eval-Official/CoreCodeBench, and the data for CoreCodeBench can be accessed at https://huggingface.co/collections/tubehhh/corecodebench-68256d2faabf4b1610a08caa.

CoreCodeBench: Decoupling Code Intelligence via Fine-Grained Repository-Level Tasks

TL;DR

CoreCodeBench introduces a fine-grained, repository-level benchmark for code intelligence by decoupling coding tasks into six atomic types (Development, BugFix, TDD) and composite tasks, all derived from identical code contexts. The CorePipe framework automates repository context extraction, atomic task generation, and composite task scaling with controllable difficulty, achieving high data quality and strong robustness to prompt variation. Experiments across SoTA LLMs reveal significant capability misalignment across cognitive dimensions and a marked challenge for interdependent multi-function tasks, underscoring the non-monolithic nature of coding proficiency. The benchmark demonstrates scalability and diagnostic value, but acknowledges limitations such as language scope and dependence on unit tests, outlining a path toward broader language support and automated test generation to sustain lifting code intelligence research.

Abstract

The evaluation of Large Language Models (LLMs) for software engineering has shifted towards complex, repository-level tasks. However, existing benchmarks predominantly rely on coarse-grained pass rates that treat programming proficiency as a monolithic capability, obscuring specific cognitive bottlenecks. Furthermore, the static nature of these benchmarks renders them vulnerable to data contamination and performance saturation. To address these limitations, we introduce CoreCodeBench, a configurable repository-level benchmark designed to dissect coding capabilities through atomized tasks. Leveraging our automated framework, CorePipe, we extract and transform Python repositories into a comprehensive suite of tasks that isolate distinct cognitive demands within identical code contexts. Unlike static evaluations, CoreCodeBench supports controllable difficulty scaling to prevent saturation and ensures superior data quality. It achieves a 78.55% validity yield, significantly surpassing the 31.7% retention rate of SWE-bench-Verified. Extensive experiments with state-of-the-art LLMs reveal a significant capability misalignment, evidenced by distinct ranking shifts across cognitive dimensions. This indicates that coding proficiency is non-monolithic, as strength in one aspect does not necessarily translate to others. These findings underscore the necessity of our fine-grained taxonomy in diagnosing model deficiencies and offer a sustainable, rigorous framework for evolving code intelligence. The code for CorePipe is available at https://github.com/AGI-Eval-Official/CoreCodeBench, and the data for CoreCodeBench can be accessed at https://huggingface.co/collections/tubehhh/corecodebench-68256d2faabf4b1610a08caa.

Paper Structure

This paper contains 73 sections, 7 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Decomposing Code Intelligence. (a) CoreCodeBench isolates distinct cognitive demands (Dev, BugFix, TDD) within an identical code context. (b) Performance comparison across these dimensions reveals significant capability misalignment, highlighting that coding proficiency is non-monolithic.
  • Figure 2: Overview of the CorePipe Framework. (i) Context Extraction builds verifiable Function Call Trees from unit tests. (ii) Atomic Task Generation isolates cognitive demands (Dev., BugFix, TDD) within identical contexts. (iii) Composite Task Scaling aggregates atomic tasks into subgraphs, modulating difficulty via Dependency Depth ($d$) and Task Quantity ($\nu$) to prevent saturation.
  • Figure 3: Pearson correlation matrix across six tasks.
  • Figure 4: Quantifying Capability Misalignment via Inter-task IoU. The generally low IoU values reveal significant inconsistency between tasks.
  • Figure 5: Difficulty Scaling Analysis. (a) Performance consistently declines as the code length increases. (b) Increasing the number of interdependent functions ($\nu$) triggers a performance collapse.
  • ...and 6 more figures