Table of Contents
Fetching ...

VeRO: An Evaluation Harness for Agents to Optimize Agents

Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan, Xue, Sam Denton

TL;DR

VERO (Versioning, Rewards, and Observations), which provides a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and a benchmark suite of target agents and tasks with reference evaluation procedures, is introduced.

Abstract

An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.

VeRO: An Evaluation Harness for Agents to Optimize Agents

TL;DR

VERO (Versioning, Rewards, and Observations), which provides a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and a benchmark suite of target agents and tasks with reference evaluation procedures, is introduced.

Abstract

An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.
Paper Structure (43 sections, 3 equations, 14 figures, 13 tables, 1 algorithm)

This paper contains 43 sections, 3 equations, 14 figures, 13 tables, 1 algorithm.

Figures (14)

  • Figure 1: VeRO system architecture. Top (orange): example optimization trajectory. Bottom (green): system components. VeRO enforces versioning, reproducible execution, and controlled feedback, enabling systematic comparison of optimizers for agent optimization.
  • Figure 2: Case study results highlighting optimization outcomes across optimizer instructions types for FACTS, GAIA, and SimpleQA. Orange circles: Pawn agent; green triangles: Knight agent. Dashed lines: baseline accuracies.
  • Figure 3: Accuracy ratio (iteration accuracy/baseline) across instruction types, aggregated over all three evaluation datasets (FACTS, GAIA, SimpleQA) for Pawn and Knight agents.
  • Figure 4: We plot the probability of changes made between successive evaluations in a trajectory falling into one of several types (e.g. prompt, tool, workflow changes) across all trajectories conducted using VeRO scaffolds. Bold numbers capture the entropy of the probability distribution defined by each bar.
  • Figure 5: Optimizer performance across tasks. Each grouped bar shows the average lift (best score minus initial score) for each optimizer configuration. Red horizontal lines indicate the mean lift per task across all configurations.
  • ...and 9 more figures