Table of Contents
Fetching ...

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, Parthasarathy Ranganathan

TL;DR

SWE-fficiency addresses the gap in evaluating repo-level performance optimization by requiring language-model agents to investigate real codebases, localize runtime bottlenecks, and produce correctness-preserving patches that speed up real workloads. The benchmark assembles 498 tasks across 9 popular Python repositories using a principled data collection pipeline and a speedup-ratio evaluation against expert gold patches, with a strict separation of correctness and performance tests. Empirical results show substantial gaps between current LM agents and expert performance, including widespread mislocalization of bottlenecks and frequent correctness regressions, even as easier tasks yield modest gains. By releasing a scalable dataset, reproducible evaluation harness, and analysis of failure modes, SWE-fficiency aims to catalyze advances in long-horizon software reasoning and automated performance engineering.

Abstract

Optimizing the performance of large-scale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize what to fix rather than how to fix code. We introduce SWE-fficiency, a benchmark for evaluating repository-level performance optimization on real workloads. Our suite contains 498 tasks across nine widely used data-science, machine-learning, and HPC repositories (e.g., numpy, pandas, scipy): given a complete codebase and a slow workload, an agent must investigate code semantics, localize bottlenecks and relevant tests, and produce a patch that matches or exceeds expert speedup while passing the same unit tests. To enable this how-to-fix evaluation, our automated pipeline scrapes GitHub pull requests for performance-improving edits, combining keyword filtering, static analysis, coverage tooling, and execution validation to both confirm expert speedup baselines and identify relevant repository unit tests. Empirical evaluation of state-of-the-art agents reveals significant underperformance. On average, agents achieve less than 0.15x the expert speedup: agents struggle in localizing optimization opportunities, reasoning about execution across functions, and maintaining correctness in proposed edits. We release the benchmark and accompanying data pipeline to facilitate research on automated performance engineering and long-horizon software reasoning.

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

TL;DR

SWE-fficiency addresses the gap in evaluating repo-level performance optimization by requiring language-model agents to investigate real codebases, localize runtime bottlenecks, and produce correctness-preserving patches that speed up real workloads. The benchmark assembles 498 tasks across 9 popular Python repositories using a principled data collection pipeline and a speedup-ratio evaluation against expert gold patches, with a strict separation of correctness and performance tests. Empirical results show substantial gaps between current LM agents and expert performance, including widespread mislocalization of bottlenecks and frequent correctness regressions, even as easier tasks yield modest gains. By releasing a scalable dataset, reproducible evaluation harness, and analysis of failure modes, SWE-fficiency aims to catalyze advances in long-horizon software reasoning and automated performance engineering.

Abstract

Optimizing the performance of large-scale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize what to fix rather than how to fix code. We introduce SWE-fficiency, a benchmark for evaluating repository-level performance optimization on real workloads. Our suite contains 498 tasks across nine widely used data-science, machine-learning, and HPC repositories (e.g., numpy, pandas, scipy): given a complete codebase and a slow workload, an agent must investigate code semantics, localize bottlenecks and relevant tests, and produce a patch that matches or exceeds expert speedup while passing the same unit tests. To enable this how-to-fix evaluation, our automated pipeline scrapes GitHub pull requests for performance-improving edits, combining keyword filtering, static analysis, coverage tooling, and execution validation to both confirm expert speedup baselines and identify relevant repository unit tests. Empirical evaluation of state-of-the-art agents reveals significant underperformance. On average, agents achieve less than 0.15x the expert speedup: agents struggle in localizing optimization opportunities, reasoning about execution across functions, and maintaining correctness in proposed edits. We release the benchmark and accompanying data pipeline to facilitate research on automated performance engineering and long-horizon software reasoning.

Paper Structure

This paper contains 94 sections, 13 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: SWE-fficiency evaluates the investigative, pass-to-pass workflow of performance engineering: given an existing codebase state and a performance workload of interest, agents must edit the codebase to speed up that workload while keeping relevant repo unit tests green.
  • Figure 2: SWE-fficiency collects tasks through a multi-stage scraping pipeline: each stage prunes candidate tasks that introduce new behavior, are unlikely to be performance related, or unsuitable for reproducible benchmarking. This yields a set of tasks, each of which have an accompanying expert or gold patch. See Appendix \ref{['appendix:additional_details_on_data_collection_procedure']} for stage-specific details.
  • Figure 3: SWE-fficiency contains a diverse distribution over performance workload runtime (left); over gold patch speedup (speedup achieved from expert PR edit); and over types of optimizations made by the expert (right). We use an LM to categorize the gold patch for each instance (for high-level analysis only) and manually verify a randomly chosen subset: see Appendix \ref{['appendix:additional_dataset_summary']}.
  • Figure 4: LMs achieve strong performance on easier problems but struggle on tasks with longer workload runtime duration and larger baseline expert speedups. We bucket LM submissions by per-instance speedup ratio and compute the geometric mean per-bucket of (i) pre-edit workload runtime, (ii) the gold (expert) patch speedup, and (iii) the number of lines in the gold patch.
  • Figure 5: LMs find expert-level wins earlier on in action trajectories. When they underperform experts, LMs submit satisficing optimizations rather than trying on for expert parity.
  • ...and 13 more figures