Table of Contents
Fetching ...

The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason

Shanchao Liang, Spandan Garg, Roshanak Zilouchian Moghaddam

TL;DR

The paper examines whether state-of-the-art LLMs genuinely solve coding tasks or merely memorize training data exposed through benchmarks like SWE-Bench. It introduces three diagnostic tasks—file-path identification, function reproduction, and prefix completion—paired with cross-benchmark analyses (SWE-Bench-Verified, SWE-Bench-C#, RefactorBench, and Outside-Repo tasks) to distinguish memorization from true reasoning. Results reveal substantial instance- and repository-level memorization, with high 5-gram overlaps and verbatim code reproduction on SWE-Bench Verified that drop on external benchmarks, implying data contamination. The findings argue for contamination-resistant benchmarks, temporal controls, and cross-benchmark validation to ensure that reported gains reflect transferable software engineering capabilities rather than memorization artifacts.

Abstract

As large language models (LLMs) become increasingly capable and widely adopted, benchmarks play a central role in assessing their practical utility. For example, SWE-Bench Verified has emerged as a critical benchmark for evaluating LLMs' software engineering abilities, particularly their aptitude for resolving real-world GitHub issues. Recent LLMs show impressive performance on SWE-Bench, leading to optimism about their capacity for complex coding tasks. However, current evaluation protocols may overstate these models' true capabilities. It is crucial to distinguish LLMs' generalizable problem-solving ability and other learned artifacts. In this work, we introduce two diagnostic tasks: file path identification from issue descriptions alone and ground truth function reproduction with only the current file context and issue description to probe models' underlying knowledge. We present empirical evidence that performance gains on SWE-Bench-Verified may be partially driven by memorization rather than genuine problem-solving. We show that state-of-the-art models achieve up to 76% accuracy in identifying buggy file paths using only issue descriptions, without access to repository structure. This performance is merely up to 53% on tasks from repositories not included in SWE-Bench, pointing to possible data contamination or memorization. Similar patterns are also observed for the function reproduction task, where the verbatim similarity is much higher on SWE-Bench Verified than on other similar coding benchmarks (up to 35% consecutive 5-gram accuracy on SWE-Bench Verified and Full, but only up to 18% for tasks in other benchmarks). These findings raise concerns about the validity of existing results and underscore the need for more robust, contamination-resistant benchmarks to reliably evaluate LLMs' coding abilities.

The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason

TL;DR

The paper examines whether state-of-the-art LLMs genuinely solve coding tasks or merely memorize training data exposed through benchmarks like SWE-Bench. It introduces three diagnostic tasks—file-path identification, function reproduction, and prefix completion—paired with cross-benchmark analyses (SWE-Bench-Verified, SWE-Bench-C#, RefactorBench, and Outside-Repo tasks) to distinguish memorization from true reasoning. Results reveal substantial instance- and repository-level memorization, with high 5-gram overlaps and verbatim code reproduction on SWE-Bench Verified that drop on external benchmarks, implying data contamination. The findings argue for contamination-resistant benchmarks, temporal controls, and cross-benchmark validation to ensure that reported gains reflect transferable software engineering capabilities rather than memorization artifacts.

Abstract

As large language models (LLMs) become increasingly capable and widely adopted, benchmarks play a central role in assessing their practical utility. For example, SWE-Bench Verified has emerged as a critical benchmark for evaluating LLMs' software engineering abilities, particularly their aptitude for resolving real-world GitHub issues. Recent LLMs show impressive performance on SWE-Bench, leading to optimism about their capacity for complex coding tasks. However, current evaluation protocols may overstate these models' true capabilities. It is crucial to distinguish LLMs' generalizable problem-solving ability and other learned artifacts. In this work, we introduce two diagnostic tasks: file path identification from issue descriptions alone and ground truth function reproduction with only the current file context and issue description to probe models' underlying knowledge. We present empirical evidence that performance gains on SWE-Bench-Verified may be partially driven by memorization rather than genuine problem-solving. We show that state-of-the-art models achieve up to 76% accuracy in identifying buggy file paths using only issue descriptions, without access to repository structure. This performance is merely up to 53% on tasks from repositories not included in SWE-Bench, pointing to possible data contamination or memorization. Similar patterns are also observed for the function reproduction task, where the verbatim similarity is much higher on SWE-Bench Verified than on other similar coding benchmarks (up to 35% consecutive 5-gram accuracy on SWE-Bench Verified and Full, but only up to 18% for tasks in other benchmarks). These findings raise concerns about the validity of existing results and underscore the need for more robust, contamination-resistant benchmarks to reliably evaluate LLMs' coding abilities.

Paper Structure

This paper contains 41 sections, 3 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Overview of our benchmark memorization detection approach.
  • Figure 2: The prompt template used to test models' ability to identify buggy files without repo access.
  • Figure 3: Minimal reproduction of RidgeClassifierCV parameter issue.
  • Figure 4: Distribution of repositories in SWE-Bench C#.
  • Figure 5: Example issue description requesting creation of new ToDoItem instances.
  • ...and 6 more figures