Table of Contents
Fetching ...

GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, Ion Stoica

TL;DR

GSO introduces a benchmark and automated pipeline to evaluate SWE-Agents on challenging high-performance software optimization tasks derived from real commits. It defines a machine-agnostic, human-targeted evaluation metric (Opt_p@K) and assesses multiple state-of-the-art models, revealing a substantial gap between current SWE-Agents and expert performance. The study combines quantitative results with a qualitative analysis of agent behavior to identify failure modes such as struggles with low-level code, reliance on lazy optimizations, and mislocalization of bottlenecks, while offering guidance for future improvements. Overall, GSO provides a rigorous, real-world testing ground for advancing reasoning and systems engineering capabilities in SWE-Agents.

Abstract

Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks. We release the code and artifacts of our benchmark along with agent trajectories to enable future research.

GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

TL;DR

GSO introduces a benchmark and automated pipeline to evaluate SWE-Agents on challenging high-performance software optimization tasks derived from real commits. It defines a machine-agnostic, human-targeted evaluation metric (Opt_p@K) and assesses multiple state-of-the-art models, revealing a substantial gap between current SWE-Agents and expert performance. The study combines quantitative results with a qualitative analysis of agent behavior to identify failure modes such as struggles with low-level code, reliance on lazy optimizations, and mislocalization of bottlenecks, while offering guidance for future improvements. Overall, GSO provides a rigorous, real-world testing ground for advancing reasoning and systems engineering capabilities in SWE-Agents.

Abstract

Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks. We release the code and artifacts of our benchmark along with agent trajectories to enable future research.

Paper Structure

This paper contains 40 sections, 3 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: An example GSO task. We develop an automated pipeline that generates performance tests and analyzes repository commit history to identify real-world code optimization tasks. Each task consists of a codebase, performance tests, and the expert developer commit that serves as the performance target for the optimization problem. LLM-based SWE-Agents are then tasked with generating optimization patches using the performance test as a precise specification for the optimization problem. We evaluate the patches for both correctness and runtime efficiency, measuring whether they match or exceed the human expert optimization performance while ensuring equivalence.
  • Figure 2: Benchmark Feature Comparison and Performance Gap. Left: Depicting how GSO improves over existing benchmarks across key dimensions. Middle: Distribution of oracle LoC changes across benchmarks, showing GSO solutions require over 4-15x larger edits than existing benchmarks. Right: Performance comparison of O4-Mini across LCB (algorithmic puzzles), SWEBench-Verified (repository-level bug-fixes), and GSO depicting the performance gap on optimization tasks.
  • Figure 3: Left. Popular optimization concepts and examples of algorithms used in ground-truth human commits for GSO tasks highlighting the algorithmic complexity of the tasks. Right. Summary statistics for GSO tasks, the groundtruth human commits, and the performance tests highlighting the repository-level nature of the tasks spanning diverse domains and languages.
  • Figure 4: $\textsc{\small Opt}\textsc{\small @}1$ performance. (a) Left: $\textsc{\small Opt}\textsc{\small @}1$ (speedup threshold $p$ set to 0.95) across models, with all models achieving less than 5% success (b) Right: $\textsc{\small Opt}_{p}\textsc{\small @}1$ indicating portion of problems where model patches match $p$ fraction of human commit's performance. We find that strongest performing models remain strong throughout, with the success rates reducing as it becomes more challenging to match human-level performance.
  • Figure 5: Scaling test-time compute for O4-Mini and Claude-3.5-v2. (a) Left: $\textsc{\small Opt}\textsc{\small @}K$ performance as a function of inference steps (L) and parallel rollouts (K), showing parallel compute scales more efficiently than serial compute. (b) Right: $\textsc{\small Opt}\textsc{\small @}K$ performance with increasing rollouts, improving to 15% with diminishing returns beyond eight rollouts.
  • ...and 9 more figures