Table of Contents
Fetching ...

What's in a Benchmark? The Case of SWE-Bench in Automated Program Repair

Matias Martinez, Xavier Franch

TL;DR

The paper investigates SWE-Bench leaderboards to understand who submits APR solutions, what products and LLMs are used, and how openness affects performance. Using a mixed-methods analysis of 212 entries across Lite and Verified, it shows strong industry participation, prevalent use of proprietary LLMs (notably Claude), and substantial open-source contributions, especially among academia. It reveals that Verified tends to yield higher precision than Lite and highlights risks like patch overfitting and data contamination that can inflate leaderboard scores. The findings offer guidance for more transparent benchmark practices, including richer metadata, shared artifacts, and mechanisms to assess patch correctness beyond test-passing, ultimately aiming to improve reproducibility and real-world generalization in APR research.

Abstract

The rapid progress in Automated Program Repair (APR) has been fueled by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a benchmark designed to evaluate repair systems using real issues mined from popular open-source Python repositories. Its public leaderboards-SWE-Bench Lite and Verified-have become central platforms for tracking progress and comparing solutions. In this paper, we present the first comprehensive study of these two leaderboards, examining who is submitting solutions, the products behind the submissions, the LLMs employed, and the openness of the approaches. We analyze 79 entries submitted to Lite leaderboard and 133 to Verified. Our results show that most entries on both leaderboards originate from industry, particularly small companies and large publicly traded companies. These submissions often achieve top results, although academic contributions-typically open source-also remain competitive. We also find a clear dominance of proprietary LLMs, especially Claude family, with state-of-the-art results on both leaderboards currently achieved by Claude 4 Sonnet. These findings offer insights into the SWE-Bench ecosystem that can guide greater transparency and diversity in future benchmark-driven research.

What's in a Benchmark? The Case of SWE-Bench in Automated Program Repair

TL;DR

The paper investigates SWE-Bench leaderboards to understand who submits APR solutions, what products and LLMs are used, and how openness affects performance. Using a mixed-methods analysis of 212 entries across Lite and Verified, it shows strong industry participation, prevalent use of proprietary LLMs (notably Claude), and substantial open-source contributions, especially among academia. It reveals that Verified tends to yield higher precision than Lite and highlights risks like patch overfitting and data contamination that can inflate leaderboard scores. The findings offer guidance for more transparent benchmark practices, including richer metadata, shared artifacts, and mechanisms to assess patch correctness beyond test-passing, ultimately aiming to improve reproducibility and real-world generalization in APR research.

Abstract

The rapid progress in Automated Program Repair (APR) has been fueled by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a benchmark designed to evaluate repair systems using real issues mined from popular open-source Python repositories. Its public leaderboards-SWE-Bench Lite and Verified-have become central platforms for tracking progress and comparing solutions. In this paper, we present the first comprehensive study of these two leaderboards, examining who is submitting solutions, the products behind the submissions, the LLMs employed, and the openness of the approaches. We analyze 79 entries submitted to Lite leaderboard and 133 to Verified. Our results show that most entries on both leaderboards originate from industry, particularly small companies and large publicly traded companies. These submissions often achieve top results, although academic contributions-typically open source-also remain competitive. We also find a clear dominance of proprietary LLMs, especially Claude family, with state-of-the-art results on both leaderboards currently achieved by Claude 4 Sonnet. These findings offer insights into the SWE-Bench ecosystem that can guide greater transparency and diversity in future benchmark-driven research.
Paper Structure (26 sections, 2 figures, 4 tables)