ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?

Ayush Nangia; Shikhar Mishra; Aman Gokrani; Paras Chopra

ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?

Ayush Nangia, Shikhar Mishra, Aman Gokrani, Paras Chopra

Abstract

We introduce ISO-Bench, a benchmark for coding agents to test their capabilities on real-world inference optimization tasks. These tasks were taken from vLLM and SGLang, two of the most popular LLM serving frameworks. Each task provides an agent with a codebase and bottleneck description, whereby the agent must produce an optimization patch evaluated against expert human solutions. We curated 54 tasks from merged pull requests with measurable performance improvements. While existing benchmarks heavily use runtime-based metrics, such approaches can be gamed to pass tests without capturing the actual intent of the code changes. Therefore, we combine both hard (execution-based) and soft (LLM-based) metrics to show that both are necessary for complete evaluation. While evaluating both closed and open-source coding agents, we find no single agent dominates across codebases. Surprisingly, agents often identify correct bottlenecks but fail to execute working solutions. We also show that agents with identical underlying models differ substantially, suggesting scaffolding is as important as the model.

ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?

Abstract

Paper Structure (58 sections, 2 equations, 15 figures, 7 tables)

This paper contains 58 sections, 2 equations, 15 figures, 7 tables.

Introduction
Related Work
Correctness-driven benchmarks:
Efficiency-driven benchmarks:
Coding Agent Architectures:
LLM-as-a-Judge for Code Evaluation:
ISO-Bench
Task Formulation
Benchmark Construction
Stage 1: Commit Extraction:
Stage 2: Manual Curation:
Stage 3: PR Analysis:
Evaluation Metrics
Hard Metrics
Soft Metrics
...and 43 more sections

Figures (15)

Figure 1: ISO-Bench evaluation pipeline. Given a codebase and task description, a coding agent produces an optimization patch. We compare this patch against the human commit using hard metrics (TTFT, throughput) and soft metrics (bottleneck targeting, implementation approach). Hard metrics measure performance improvement; soft metrics assess whether the agent targeted the correct code.
Figure 2: Quadrant framework for evaluating optimization attempts. The horizontal axis shows performance (good: beats or similar; bad: worse or failed). The vertical axis shows whether the agent targeted the correct bottleneck (correct: same or related target; wrong: different target or no optimization). Q1 True Success: correct target, good performance. Q2 Good Intent: correct target, bad performance. Q3 Lucky Win: wrong target, good performance. Q4 Total Failure: wrong target, bad performance.
Figure 3: Good Intent vs Bad Execution on vLLM (39 tasks). Light bars show correct target identification (Q1+Q2). Dark bars show True Success (Q1). The gap represents Q2 failures.
Figure 4: Good Intent vs Bad Execution on SGLang (15 tasks). Light bars show correct target identification (Q1+Q2). Dark bars show True Success (Q1). The gap represents Q2 failures.
Figure 5: Approach distribution on vLLM (39 tasks).
...and 10 more figures

ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?

Abstract

ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?

Authors

Abstract

Table of Contents

Figures (15)