Table of Contents
Fetching ...

Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study

You Wang, Michael Pradel, Zhongxin Liu

TL;DR

This study investigates the reliability of patches labeled correct by SWE-bench Verified, revealing substantial overestimation due to weak test suites. It introduces PatchDiff, a differential patch-testing technique that uses LLM-generated differentiating tests to reveal behavioral discrepancies between plausible and oracle patches. Findings show 7.8% average incorrect patches when validating against all developer tests, and 29.6% of plausible patches exhibiting behavioral divergence from ground truth, with 11.0% extrapolated incorrectness and a 6.4-point inflation of resolution rates. The work advocates robust patch validation, integration of differentiating tests into benchmarks, and the development of better-specified issue statements to enable sustainable, trustworthy evaluation of automated issue-solving tools.

Abstract

Automated issue solving aims to resolve real-world issues in software repositories. The most popular benchmarks for automated issue solving are SWE-bench and its human-filtered subset SWE-bench Verified. These benchmarks leverage testing to validate generated patches. However, because testing is rarely exhaustive, a patch may pass the tests but nevertheless fail to match the developers' expectations. Unfortunately, it is currently unclear to what extent evaluations performed with SWE-bench suffer from such plausible but incorrect patches. This paper presents an in-depth empirical study of the correctness of plausible patches generated by three state-of-the-art issue-solving tools evaluated on SWE-bench Verified. We extensively test and inspect generated patches, and compare them against human-written ground truth patches. The core of our methodology is a novel technique PatchDiff for differential patch testing, which automatically exposes behavioral discrepancies between two patches. Our findings reveal critical weaknesses in SWE-bench's patch validation mechanism, which causes 7.8% of all patches to count as correct while failing the developer-written test suite. Moreover, our novel automated technique reveals that even more (29.6%) plausible patches induce different behavior than the ground truth patches. These behavioral differences are often due to similar, but divergent implementations (46.8%) and due to generated patches that adapt more behavior than the ground truth patches (27.3%). Our manual inspection shows that 28.6% of behaviorally divergent patches are certainly incorrect. Combined, the different weaknesses lead to an inflation of reported resolution rates by 6.2 absolute percent points. Our findings are a call to arms for more robust and reliable evaluation of issue-solving tools. We envision our automated differential patch testing technique to be useful for this purpose.

Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study

TL;DR

This study investigates the reliability of patches labeled correct by SWE-bench Verified, revealing substantial overestimation due to weak test suites. It introduces PatchDiff, a differential patch-testing technique that uses LLM-generated differentiating tests to reveal behavioral discrepancies between plausible and oracle patches. Findings show 7.8% average incorrect patches when validating against all developer tests, and 29.6% of plausible patches exhibiting behavioral divergence from ground truth, with 11.0% extrapolated incorrectness and a 6.4-point inflation of resolution rates. The work advocates robust patch validation, integration of differentiating tests into benchmarks, and the development of better-specified issue statements to enable sustainable, trustworthy evaluation of automated issue-solving tools.

Abstract

Automated issue solving aims to resolve real-world issues in software repositories. The most popular benchmarks for automated issue solving are SWE-bench and its human-filtered subset SWE-bench Verified. These benchmarks leverage testing to validate generated patches. However, because testing is rarely exhaustive, a patch may pass the tests but nevertheless fail to match the developers' expectations. Unfortunately, it is currently unclear to what extent evaluations performed with SWE-bench suffer from such plausible but incorrect patches. This paper presents an in-depth empirical study of the correctness of plausible patches generated by three state-of-the-art issue-solving tools evaluated on SWE-bench Verified. We extensively test and inspect generated patches, and compare them against human-written ground truth patches. The core of our methodology is a novel technique PatchDiff for differential patch testing, which automatically exposes behavioral discrepancies between two patches. Our findings reveal critical weaknesses in SWE-bench's patch validation mechanism, which causes 7.8% of all patches to count as correct while failing the developer-written test suite. Moreover, our novel automated technique reveals that even more (29.6%) plausible patches induce different behavior than the ground truth patches. These behavioral differences are often due to similar, but divergent implementations (46.8%) and due to generated patches that adapt more behavior than the ground truth patches (27.3%). Our manual inspection shows that 28.6% of behaviorally divergent patches are certainly incorrect. Combined, the different weaknesses lead to an inflation of reported resolution rates by 6.2 absolute percent points. Our findings are a call to arms for more robust and reliable evaluation of issue-solving tools. We envision our automated differential patch testing technique to be useful for this purpose.

Paper Structure

This paper contains 30 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: An example of plausible but incorrect patches from the issue sympy-22714, where the exception is correctly raised only under the oracle patch
  • Figure 2: Example of aligned sem-changes (sympy-23262)
  • Figure 3: Three examples of patch difference patterns