Table of Contents
Fetching ...

Evaluating the Generalization Capabilities of Large Language Models on Code Reasoning

Rem Yang, Julian Dai, Nikos Vasilakis, Martin Rinard

TL;DR

The paper tackles whether large language models truly generalize in code reasoning or primarily memorize training data. It introduces two out-of-distribution generation methods—domain-specific language sampling and program mutation—and evaluates 10 models across three benchmark sets (LLM-List, DSL-List, LeetCode) using execution prediction and execution choice tasks. Findings show a clear progression: earlier models rely on pattern matching, while the latest reasoning models generalize more robustly across original and mutated programs, including DSL-derived and LeetCode problems. The work provides a rigorous framework to measure true generalization in code reasoning and emphasizes the importance of mutation-based benchmarks beyond data-cutoff strategies for robust AI-assisted software engineering.

Abstract

We assess how the code reasoning abilities of large language models (LLMs) generalize to different kinds of programs. We present techniques for obtaining in- and out-of-distribution programs with different characteristics: code sampled from a domain-specific language, code automatically generated by an LLM, code collected from competitive programming contests, and mutated versions of these programs. We also present an experimental methodology for evaluating LLM generalization by comparing their performance on these programs. We perform an extensive evaluation across 10 state-of-the-art models from the past year, obtaining insights into their generalization capabilities over time and across different classes of programs. Our results highlight that while earlier models exhibit behavior consistent with pattern matching, the latest models exhibit strong generalization abilities on code reasoning.

Evaluating the Generalization Capabilities of Large Language Models on Code Reasoning

TL;DR

The paper tackles whether large language models truly generalize in code reasoning or primarily memorize training data. It introduces two out-of-distribution generation methods—domain-specific language sampling and program mutation—and evaluates 10 models across three benchmark sets (LLM-List, DSL-List, LeetCode) using execution prediction and execution choice tasks. Findings show a clear progression: earlier models rely on pattern matching, while the latest reasoning models generalize more robustly across original and mutated programs, including DSL-derived and LeetCode problems. The work provides a rigorous framework to measure true generalization in code reasoning and emphasizes the importance of mutation-based benchmarks beyond data-cutoff strategies for robust AI-assisted software engineering.

Abstract

We assess how the code reasoning abilities of large language models (LLMs) generalize to different kinds of programs. We present techniques for obtaining in- and out-of-distribution programs with different characteristics: code sampled from a domain-specific language, code automatically generated by an LLM, code collected from competitive programming contests, and mutated versions of these programs. We also present an experimental methodology for evaluating LLM generalization by comparing their performance on these programs. We perform an extensive evaluation across 10 state-of-the-art models from the past year, obtaining insights into their generalization capabilities over time and across different classes of programs. Our results highlight that while earlier models exhibit behavior consistent with pattern matching, the latest models exhibit strong generalization abilities on code reasoning.

Paper Structure

This paper contains 43 sections, 7 figures, 6 tables, 2 algorithms.

Figures (7)

  • Figure 1: Example programs from the LLM-List (left), DSL-List (center), and LeetCode (right) datasets.
  • Figure 2: Domain-specific language for the DSL-List dataset. L denotes List, and t0 and t1 are polymorphic types. We allow integers in the range $[-1, 5]$.
  • Figure 3: Overview of the execution prediction experiment. Given an original program $P$, its mutated version $P'$, and an input $x$, we instruct an LLM to predict the outputs of the original program $P(x)$ and the mutated program $P'(x)$. For the original program, we check if the model prediction correctly matches $P(x)$ (OC) or incorrectly matches $P'(x)$ (OR). For the mutated program, we check if the model prediction correctly matches $P'(x)$ (MC) or incorrectly matches $P(x)$ (MR).
  • Figure 4: Execution prediction results on list datasets as a function of lines of code.
  • Figure 5: Execution prediction results on LeetCode as a function of lines of code.
  • ...and 2 more figures