CoreCodeBench: Decoupling Code Intelligence via Fine-Grained Repository-Level Tasks

Lingyue Fu; Hao Guan; Bolun Zhang; Haowei Yuan; Yaoming Zhu; Jun Xu; Zongyu Wang; Lin Qiu; Xunliang Cai; Xuezhi Cao; Weiwen Liu; Weinan Zhang; Yong Yu

CoreCodeBench: Decoupling Code Intelligence via Fine-Grained Repository-Level Tasks

Lingyue Fu, Hao Guan, Bolun Zhang, Haowei Yuan, Yaoming Zhu, Jun Xu, Zongyu Wang, Lin Qiu, Xunliang Cai, Xuezhi Cao, Weiwen Liu, Weinan Zhang, Yong Yu

TL;DR

CoreCodeBench introduces a fine-grained, repository-level benchmark for code intelligence by decoupling coding tasks into six atomic types (Development, BugFix, TDD) and composite tasks, all derived from identical code contexts. The CorePipe framework automates repository context extraction, atomic task generation, and composite task scaling with controllable difficulty, achieving high data quality and strong robustness to prompt variation. Experiments across SoTA LLMs reveal significant capability misalignment across cognitive dimensions and a marked challenge for interdependent multi-function tasks, underscoring the non-monolithic nature of coding proficiency. The benchmark demonstrates scalability and diagnostic value, but acknowledges limitations such as language scope and dependence on unit tests, outlining a path toward broader language support and automated test generation to sustain lifting code intelligence research.

Abstract

The evaluation of Large Language Models (LLMs) for software engineering has shifted towards complex, repository-level tasks. However, existing benchmarks predominantly rely on coarse-grained pass rates that treat programming proficiency as a monolithic capability, obscuring specific cognitive bottlenecks. Furthermore, the static nature of these benchmarks renders them vulnerable to data contamination and performance saturation. To address these limitations, we introduce CoreCodeBench, a configurable repository-level benchmark designed to dissect coding capabilities through atomized tasks. Leveraging our automated framework, CorePipe, we extract and transform Python repositories into a comprehensive suite of tasks that isolate distinct cognitive demands within identical code contexts. Unlike static evaluations, CoreCodeBench supports controllable difficulty scaling to prevent saturation and ensures superior data quality. It achieves a 78.55% validity yield, significantly surpassing the 31.7% retention rate of SWE-bench-Verified. Extensive experiments with state-of-the-art LLMs reveal a significant capability misalignment, evidenced by distinct ranking shifts across cognitive dimensions. This indicates that coding proficiency is non-monolithic, as strength in one aspect does not necessarily translate to others. These findings underscore the necessity of our fine-grained taxonomy in diagnosing model deficiencies and offer a sustainable, rigorous framework for evolving code intelligence. The code for CorePipe is available at https://github.com/AGI-Eval-Official/CoreCodeBench, and the data for CoreCodeBench can be accessed at https://huggingface.co/collections/tubehhh/corecodebench-68256d2faabf4b1610a08caa.

CoreCodeBench: Decoupling Code Intelligence via Fine-Grained Repository-Level Tasks

TL;DR

Abstract

CoreCodeBench: Decoupling Code Intelligence via Fine-Grained Repository-Level Tasks

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)