Table of Contents
Fetching ...

Evaluating Agentic Optimization on Large Codebases

Atharva Sehgal, James Hou, Akanksha Sarkar, Ishaan Mantripragada, Swarat Chaudhuri, Jennifer J. Sun, Yisong Yue

Abstract

Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to assess holistic optimization behavior. We introduce FormulaCode, a benchmark for evaluating agentic optimization on large, real-world codebases with fine-grained, multi-objective performance metrics. FormulaCode comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert-authored patches and, on average, 264.6 community-maintained performance workloads per task, enabling the holistic ability of LLM agents to optimize codebases under realistic correctness and performance constraints. Our evaluations reveal that repository-scale, multi-objective optimization remains a major challenge for frontier LLM agents. Project website at: https://formula-code.github.io

Evaluating Agentic Optimization on Large Codebases

Abstract

Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to assess holistic optimization behavior. We introduce FormulaCode, a benchmark for evaluating agentic optimization on large, real-world codebases with fine-grained, multi-objective performance metrics. FormulaCode comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert-authored patches and, on average, 264.6 community-maintained performance workloads per task, enabling the holistic ability of LLM agents to optimize codebases under realistic correctness and performance constraints. Our evaluations reveal that repository-scale, multi-objective optimization remains a major challenge for frontier LLM agents. Project website at: https://formula-code.github.io
Paper Structure (65 sections, 10 equations, 27 figures, 14 tables)

This paper contains 65 sections, 10 equations, 27 figures, 14 tables.

Figures (27)

  • Figure 1: FormulaCode is a continuously updating benchmark for evaluating the holistic ability of agents to optimize large codebases. Each task in FormulaCode comprises a problem description of a performance regression from GitHub, an environment containing a baseline repository snapshot, and multiple expert-written crowdsourced performance workloads, along with the tools to execute them. An agent's performance improving edits are assessed based on their ability to outperform expert-written edits in optimizing multiple workloads while meeting multiple forms of correctness guarantees.
  • Figure 2: Overview of FormulaCode construction pipeline. FormulaCode follows a four stage pipeline to identify real-world performance optimization tasks. (1) Scrape compliant repositories (§ \ref{['sec:dataset.construction.stage1']}). (2) Apply rule-based and LLM-based filters to identify candidate performance improvement pull requests (§ \ref{['sec:dataset.construction.stage2']}). (3) Construct reproducible Docker environments for each candidate (§ \ref{['sec:dataset.construction.stage3']}). (4) Validate each candidate for correctness and statistically significant performance improvement (§ \ref{['sec:dataset.construction.stage4']}). The pipeline is fully automated and updates FormulaCode with new tasks every month.
  • Figure 3: Showing stratified advantage across hierarchy levels for each agent--model configuration. Each line traces the stratified advantage ($\texttt{Adv}_{\texttt{agent}}^{(\ell)}$) over $\ell \in \{1,2,3\}$, revealing whether a configuration prefers coarse module-level changes or fine-grained function-level edits.
  • Figure 4: Cost-Performance tradeoff of agent-model configurations. As most agents struggle on code optimizations tasks, the pareto set is primarily dominated by the most expensive model (Claude 4.0 Sonnet).
  • Figure 5: Multi-workload tradeoff performance of agent-model configurations. We quantify a model's speedup performance as a function of its worst regression. The expert patch achieves the highest speedup while negotiating considerably high workload regressions.
  • ...and 22 more figures