Table of Contents
Fetching ...

Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings

Changshu Liu, Alireza Ghazanfari, Yang Chen, Reyhaneh Jabbarvand

TL;DR

RE2-Bench addresses the gap between synthetic benchmarks and real-world code reasoning by collecting 1101 problems from real-world projects and mature datasets, and by automatically serializing complex variables for rigorous input/output reasoning. It employs static/dynamic analysis to derive nine code complexity metrics and uses adaptive majority-voting to categorize problems into two well-separated difficulty levels, with versioning to scale as complexity grows. Large-scale evaluation across six LLMs on IP/OP tasks reveals a substantial performance drop from Easy to Hard, challenging prior conclusions about LLMs' code reasoning capabilities and highlighting a need for more realistic benchmarks. The work also introduces a fine-grained failure taxonomy and comprehensive ablation studies on prompting strategies and categorization parameters to guide future improvements in code reasoning models and evaluation methodologies.

Abstract

Code reasoning tasks are becoming prevalent in large language model (LLM) assessments. Existing benchmarks involve simple programs, failing to represent real-world complexities such as inter- or intra-procedural dependencies, core or third-party API calls, highly nested constructs, and non-primitive complex types. Evaluating LLMs under such a simplistic setting poses a significant threat to assumptions about their generalizability in practice. To enable a more realistic evaluation of code reasoning, this paper proposes RE2-Bench, a benchmark of 1,101 reasoning problems, including 195 drawn from mature real-world projects. RE2-Bench leverages static and dynamic program analysis to automatically serialize and deserialize compound, complex, and custom types in real-world code, going far beyond the primitive-only settings used in prior work. A key feature of RE2-Bench is categorizing each reasoning problem as Easy or Hard via a principled majority-vote mechanism over nine interpretable code complexity metrics, resulting in two well-separated and semantically meaningful difficulty categories suitable for precise calibration of LLM reasoning ability. A comprehensive evaluation of six general-purpose and reasoning-oriented LLMs on two widely used code reasoning tasks -- input prediction and output prediction -- using RE2-Bench reveals a significant performance drop from Easy to Hard problems (51.50\% for input prediction and 42.15\% for output prediction), confirming that prior evaluations substantially overestimate the reasoning capabilities of LLMs.

Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings

TL;DR

RE2-Bench addresses the gap between synthetic benchmarks and real-world code reasoning by collecting 1101 problems from real-world projects and mature datasets, and by automatically serializing complex variables for rigorous input/output reasoning. It employs static/dynamic analysis to derive nine code complexity metrics and uses adaptive majority-voting to categorize problems into two well-separated difficulty levels, with versioning to scale as complexity grows. Large-scale evaluation across six LLMs on IP/OP tasks reveals a substantial performance drop from Easy to Hard, challenging prior conclusions about LLMs' code reasoning capabilities and highlighting a need for more realistic benchmarks. The work also introduces a fine-grained failure taxonomy and comprehensive ablation studies on prompting strategies and categorization parameters to guide future improvements in code reasoning models and evaluation methodologies.

Abstract

Code reasoning tasks are becoming prevalent in large language model (LLM) assessments. Existing benchmarks involve simple programs, failing to represent real-world complexities such as inter- or intra-procedural dependencies, core or third-party API calls, highly nested constructs, and non-primitive complex types. Evaluating LLMs under such a simplistic setting poses a significant threat to assumptions about their generalizability in practice. To enable a more realistic evaluation of code reasoning, this paper proposes RE2-Bench, a benchmark of 1,101 reasoning problems, including 195 drawn from mature real-world projects. RE2-Bench leverages static and dynamic program analysis to automatically serialize and deserialize compound, complex, and custom types in real-world code, going far beyond the primitive-only settings used in prior work. A key feature of RE2-Bench is categorizing each reasoning problem as Easy or Hard via a principled majority-vote mechanism over nine interpretable code complexity metrics, resulting in two well-separated and semantically meaningful difficulty categories suitable for precise calibration of LLM reasoning ability. A comprehensive evaluation of six general-purpose and reasoning-oriented LLMs on two widely used code reasoning tasks -- input prediction and output prediction -- using RE2-Bench reveals a significant performance drop from Easy to Hard problems (51.50\% for input prediction and 42.15\% for output prediction), confirming that prior evaluations substantially overestimate the reasoning capabilities of LLMs.

Paper Structure

This paper contains 26 sections, 1 equation, 19 figures, 6 tables, 2 algorithms.

Figures (19)

  • Figure 1: Complexity distribution of reasoning problems from prior techniques, compared to real-world programs. The metric value for sub-figures (a)--(d) ranges from $0$ to $20$, and for sub-figure (e) ranges from $0$ to $40$
  • Figure 2: The reasoning problem (highlighted methods) from SWE-bench
  • Figure 3: Example of custom types in real-world reasoning problems and RE2-Bench serialization
  • Figure 4: Breakdown of the RE2-Bench reasoning problems across difficulty levels and difficulty metrics for a cut-off threshold of $25\%$. The outer and inner donuts indicate the problems ($p_j$s) with their $l_i^j$ labeled as L (Low) and H (High), respectively.
  • Figure 5: RE2-Bench problems overview
  • ...and 14 more figures