Table of Contents
Fetching ...

CodeGlance: Understanding Code Reasoning Challenges in LLMs through Multi-Dimensional Feature Analysis

Yunkun Wang, Xuanhe Zhang, Junxiao Han, Chen Zhi, Shuiguang Deng

TL;DR

CodeGlance introduces a multi-dimensional benchmark to evaluate LLMs on dynamic code behavior across intrinsic logic, API interactions, and unseen functions. The method combines a formal task definition with a three-scenario data framework, a structured problem-construction pipeline, and a nine-feature analysis to dissect reasoning difficulty. Key findings show unseen-function reasoning remains the hardest challenge for smaller models and benefits substantially from scaling, while dynamic execution features consistently drive difficulty more than static structure; augmentation strategies offer scenario-dependent gains. The work provides practical guidance for designing code-assisted AI tools, suggesting when to apply chain-of-thought prompting, retrieval-augmented methods, or code/documentation search to address specific knowledge and execution challenges in real-world software development.

Abstract

In modern software development, developers frequently need to understand code behavior at a glance -- whether reviewing pull requests, debugging issues, or navigating unfamiliar codebases. This ability to reason about dynamic program behavior is fundamental to effective software engineering and increasingly supported by Large Language Models (LLMs). However, existing studies on code reasoning focus primarily on isolated code snippets, overlooking the complexity of real-world scenarios involving external API interactions and unfamiliar functions. This gap hinders our understanding of what truly makes code reasoning challenging for LLMs across diverse programming contexts. We present CodeGlance, a multi-dimensional benchmark investigating code reasoning challenges across three realistic scenarios: intrinsic logic reasoning, API interaction reasoning, and unseen function reasoning. Through systematic evaluation of 7 state-of-the-art LLMs, we reveal that unseen function reasoning poses significant challenges especially for smaller models, with Qwen2.5-3b achieving only 6.0\% accuracy on unseen functions compared to 37.5\% on familiar APIs. We identify critical code complexity features -- including execution trace length, API invocation count, and control flow complexity -- that significantly impact code reasoning difficulty across scenarios. We further investigate how common augmentation strategies, including CoT, document retrieval, and code search, can improve reasoning performance, finding that their effectiveness varies substantially depending on whether challenges stem from logical complexity or knowledge gaps. These findings provide actionable guidance for developing more capable code reasoning systems and deploying LLM-based programming assistants in real-world software development.

CodeGlance: Understanding Code Reasoning Challenges in LLMs through Multi-Dimensional Feature Analysis

TL;DR

CodeGlance introduces a multi-dimensional benchmark to evaluate LLMs on dynamic code behavior across intrinsic logic, API interactions, and unseen functions. The method combines a formal task definition with a three-scenario data framework, a structured problem-construction pipeline, and a nine-feature analysis to dissect reasoning difficulty. Key findings show unseen-function reasoning remains the hardest challenge for smaller models and benefits substantially from scaling, while dynamic execution features consistently drive difficulty more than static structure; augmentation strategies offer scenario-dependent gains. The work provides practical guidance for designing code-assisted AI tools, suggesting when to apply chain-of-thought prompting, retrieval-augmented methods, or code/documentation search to address specific knowledge and execution challenges in real-world software development.

Abstract

In modern software development, developers frequently need to understand code behavior at a glance -- whether reviewing pull requests, debugging issues, or navigating unfamiliar codebases. This ability to reason about dynamic program behavior is fundamental to effective software engineering and increasingly supported by Large Language Models (LLMs). However, existing studies on code reasoning focus primarily on isolated code snippets, overlooking the complexity of real-world scenarios involving external API interactions and unfamiliar functions. This gap hinders our understanding of what truly makes code reasoning challenging for LLMs across diverse programming contexts. We present CodeGlance, a multi-dimensional benchmark investigating code reasoning challenges across three realistic scenarios: intrinsic logic reasoning, API interaction reasoning, and unseen function reasoning. Through systematic evaluation of 7 state-of-the-art LLMs, we reveal that unseen function reasoning poses significant challenges especially for smaller models, with Qwen2.5-3b achieving only 6.0\% accuracy on unseen functions compared to 37.5\% on familiar APIs. We identify critical code complexity features -- including execution trace length, API invocation count, and control flow complexity -- that significantly impact code reasoning difficulty across scenarios. We further investigate how common augmentation strategies, including CoT, document retrieval, and code search, can improve reasoning performance, finding that their effectiveness varies substantially depending on whether challenges stem from logical complexity or knowledge gaps. These findings provide actionable guidance for developing more capable code reasoning systems and deploying LLM-based programming assistants in real-world software development.
Paper Structure (43 sections, 1 equation, 28 figures, 6 tables)

This paper contains 43 sections, 1 equation, 28 figures, 6 tables.

Figures (28)

  • Figure 1: Pipeline of CodeGlance Benchmark Construction and Framework for Empirical Study.
  • Figure 2: Code lines analysis in CRUXEval.
  • Figure 3: Cyclomatic complexity analysis in CRUXEval.
  • Figure 4: Execution trace length analysis in CRUXEval.
  • Figure 5: API count analysis in DS-1000.
  • ...and 23 more figures