Table of Contents
Fetching ...

AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?

Ori Press, Brandon Amos, Haoyu Zhao, Yikai Wu, Samuel K. Ainsworth, Dominik Krupke, Patrick Kidger, Touqir Sajed, Bartolomeo Stellato, Jisun Park, Nathanael Bosch, Eli Meril, Albert Steppi, Arman Zharmagambetov, Fangzhao Zhang, David Perez-Pineiro, Alberto Mercurio, Ni Zhan, Talor Abramovich, Kilian Lieret, Hanlin Zhang, Shirley Huang, Matthias Bethge, Ofir Press

TL;DR

AlgoTune tackles the problem of whether language models can go beyond reproducing known code to autonomously produce highly optimized numerical algorithms. The authors introduce a 154-task benchmark spanning math, physics, and CS, with a robust verification and timing framework, and an LM-driven agent, AlgoTuner, that iteratively refines code to achieve speedups. Findings show average improvements of $1.72\times$ over reference solvers, but optimizations are predominantly surface-level rather than novel algorithms, highlighting both the potential and current limits of LM-assisted code optimization. This work provides a path toward integrating LM-driven optimization into widely used libraries, with implications for accelerating numerical computing in practice.

Abstract

Despite progress in language model (LM) capabilities, evaluations have thus far focused on models' performance on tasks that humans have previously solved, including in programming (Jimenez et al., 2024) and mathematics (Glazer et al., 2024). We therefore propose testing models' ability to design and implement algorithms in an open-ended benchmark: We task LMs with writing code that efficiently solves computationally challenging problems in computer science, physics, and mathematics. Our AlgoTune benchmark consists of 154 coding tasks collected from domain experts and a framework for validating and timing LM-synthesized solution code, which is compared to reference implementations from popular open-source packages. In addition, we develop a baseline LM agent, AlgoTuner, and evaluate its performance across a suite of frontier models. AlgoTuner uses a simple, budgeted loop that edits code, compiles and runs it, profiles performance, verifies correctness on tests, and selects the fastest valid version. AlgoTuner achieves an average 1.72x speedup against our reference solvers, which use libraries such as SciPy, sk-learn and CVXPY. However, we find that current models fail to discover algorithmic innovations, instead preferring surface-level optimizations. We hope that AlgoTune catalyzes the development of LM agents exhibiting creative problem solving beyond state-of-the-art human performance.

AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?

TL;DR

AlgoTune tackles the problem of whether language models can go beyond reproducing known code to autonomously produce highly optimized numerical algorithms. The authors introduce a 154-task benchmark spanning math, physics, and CS, with a robust verification and timing framework, and an LM-driven agent, AlgoTuner, that iteratively refines code to achieve speedups. Findings show average improvements of over reference solvers, but optimizations are predominantly surface-level rather than novel algorithms, highlighting both the potential and current limits of LM-assisted code optimization. This work provides a path toward integrating LM-driven optimization into widely used libraries, with implications for accelerating numerical computing in practice.

Abstract

Despite progress in language model (LM) capabilities, evaluations have thus far focused on models' performance on tasks that humans have previously solved, including in programming (Jimenez et al., 2024) and mathematics (Glazer et al., 2024). We therefore propose testing models' ability to design and implement algorithms in an open-ended benchmark: We task LMs with writing code that efficiently solves computationally challenging problems in computer science, physics, and mathematics. Our AlgoTune benchmark consists of 154 coding tasks collected from domain experts and a framework for validating and timing LM-synthesized solution code, which is compared to reference implementations from popular open-source packages. In addition, we develop a baseline LM agent, AlgoTuner, and evaluate its performance across a suite of frontier models. AlgoTuner uses a simple, budgeted loop that edits code, compiles and runs it, profiles performance, verifies correctness on tests, and selects the fastest valid version. AlgoTuner achieves an average 1.72x speedup against our reference solvers, which use libraries such as SciPy, sk-learn and CVXPY. However, we find that current models fail to discover algorithmic innovations, instead preferring surface-level optimizations. We hope that AlgoTune catalyzes the development of LM agents exhibiting creative problem solving beyond state-of-the-art human performance.

Paper Structure

This paper contains 41 sections, 6 figures, 11 tables, 1 algorithm.

Figures (6)

  • Figure 1: AlgoTune challenges LMs to optimize 154 numerical functions, including QR Decomposition, gzip Compression and PageRank. We score LMs based on how much faster their generated code is than reference solvers. For concrete examples, see §\ref{['sec:qualitative_analysis']}.
  • Figure 2: The task collection pipeline for AlgoTune. We define an input generation method and solver for each task, along with a solution verifier. Automatic tests are then executed to check the validity of the task's implementation.
  • Figure 3: AlgoTune scores (on the development set of input problems) across all tasks, during the running of AlgoTuner, for intermediate budget splits, up to the total budget of $1.
  • Figure 4: Left: Our feedback controller task starts with a reference CVXPY implementation solving an SDP formulation. Right: AlgoTuner with o4-mini improves upon the runtime by a factor of 81 by rewriting it to use SciPy's discrete algebraic Ricatti equation (DARE) solver.
  • Figure 5: Left: Our original code for a PSD cone projection of a symmetric matrix projects the eigenvalues to be non-negative. Right: AlgoTuner with Claude Opus 4 improves the code by a factor of 8 by 1) using a symmetric eigendecomposition, and 2) not forming the eigenvalue matrix and instead applying them directly to the eigenvectors.
  • ...and 1 more figures