Table of Contents
Fetching ...

SWE-Bench+: Enhanced Coding Benchmark for LLMs

Reem Aleithan, Haoran Xue, Mohammad Mahdi Mohajer, Elijah Nnorom, Gias Uddin, Song Wang

TL;DR

The paper scrutinizes SWE-bench, a real-world coding benchmark for LLMs, uncovering pervasive solution leakage and weak tests that inflate reported patch-resolution rates. It introduces SWE-Bench+ by filtering leakage and aligning data with post-cutoff knowledge, demonstrating a substantial drop in reported resolutions and revealing persistent weaknesses in test adequacy. Through patch-validation analyses, the work highlights the need for robust test suites and bias-free datasets to accurately gauge LLM-driven program repair capabilities. It also provides a cost-aware framework for evaluating practical deployment trade-offs among competing approaches. Overall, SWE-Bench+ offers a more trustworthy foundation for assessing true LLM effectiveness in software maintenance tasks and points to future work on strengthening tests and data curation.

Abstract

Large Language Models (LLMs) in Software Engineering (SE) can offer assistance for coding. To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-bench dataset, which comprises 2,294 real-world GitHub issues and their corresponding pull requests, collected from 12 widely used Python repositories. Several impressive LLM-based toolkits recently are developed and evaluated on this dataset. However, a systematic evaluation of the quality of SWE-bench remains missing. In this paper, we addressed this gap by presenting an empirical analysis of the SWE-bench dataset. We conducted a manual screening of instances where SWEAgent + GPT-4 successfully resolved issues by comparing the model-generated patches with the actual pull requests. SWE-Agent+GPT-4 was at the top of SWE-bench leaderboard during the time of our study. Our analysis reveals some critical issues with the SWE-bench dataset: 1) 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments. We refer to as solution leakage problem. 2) 31.08% of the passed patches are suspicious patches due to weak test cases, i.e., the tests were not adequate to verify the correctness of a patch. When we filtered out these problematic issues, the resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%. We also observed that the same data quality issues also exist in the two variants of SWE-bench, i.e., SWE-bench Lite and SWE-Bench Verified. In addition, over 94% of the issues were created before LLM's knowledge cutoff dates, posing potential data leakage issues.

SWE-Bench+: Enhanced Coding Benchmark for LLMs

TL;DR

The paper scrutinizes SWE-bench, a real-world coding benchmark for LLMs, uncovering pervasive solution leakage and weak tests that inflate reported patch-resolution rates. It introduces SWE-Bench+ by filtering leakage and aligning data with post-cutoff knowledge, demonstrating a substantial drop in reported resolutions and revealing persistent weaknesses in test adequacy. Through patch-validation analyses, the work highlights the need for robust test suites and bias-free datasets to accurately gauge LLM-driven program repair capabilities. It also provides a cost-aware framework for evaluating practical deployment trade-offs among competing approaches. Overall, SWE-Bench+ offers a more trustworthy foundation for assessing true LLM effectiveness in software maintenance tasks and points to future work on strengthening tests and data curation.

Abstract

Large Language Models (LLMs) in Software Engineering (SE) can offer assistance for coding. To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-bench dataset, which comprises 2,294 real-world GitHub issues and their corresponding pull requests, collected from 12 widely used Python repositories. Several impressive LLM-based toolkits recently are developed and evaluated on this dataset. However, a systematic evaluation of the quality of SWE-bench remains missing. In this paper, we addressed this gap by presenting an empirical analysis of the SWE-bench dataset. We conducted a manual screening of instances where SWEAgent + GPT-4 successfully resolved issues by comparing the model-generated patches with the actual pull requests. SWE-Agent+GPT-4 was at the top of SWE-bench leaderboard during the time of our study. Our analysis reveals some critical issues with the SWE-bench dataset: 1) 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments. We refer to as solution leakage problem. 2) 31.08% of the passed patches are suspicious patches due to weak test cases, i.e., the tests were not adequate to verify the correctness of a patch. When we filtered out these problematic issues, the resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%. We also observed that the same data quality issues also exist in the two variants of SWE-bench, i.e., SWE-bench Lite and SWE-Bench Verified. In addition, over 94% of the issues were created before LLM's knowledge cutoff dates, posing potential data leakage issues.

Paper Structure

This paper contains 14 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Comparison of performance metrics and patterns across SWE-bench datasets
  • Figure 2: Overview of robustness analysis for SWE-Bench datasets
  • Figure 3: Solution Leakage in issue report for sympy-16669
  • Figure 4: Incorrect fix generated by the model for django-32517
  • Figure 5: Different files changed by model for issue-26093 of Matplotlib
  • ...and 2 more figures