The World Won't Stay Still: Programmable Evolution for Agent Benchmarks

Guangrui Li; Yaochen Xie; Yi Liu; Ziwei Dong; Xingyuan Pan; Tianqi Zheng; Jason Choi; Michael J. Morais; Binit Jha; Shaunak Mishra; Bingrou Zhou; Chen Luo; Monica Xiao Cheng; Dawn Song

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks

Guangrui Li, Yaochen Xie, Yi Liu, Ziwei Dong, Xingyuan Pan, Tianqi Zheng, Jason Choi, Michael J. Morais, Binit Jha, Shaunak Mishra, Bingrou Zhou, Chen Luo, Monica Xiao Cheng, Dawn Song

TL;DR

This paper proposes ProEvolve, a graph-based framework that makes environment evolution programmable, and can program the evolutionary dynamics as graph transformations to generate environments automatically, and instantiate task sandboxes via subgraph sampling and programming.

Abstract

LLM-powered agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks assume static environments with fixed schemas and toolsets, neglecting the evolutionary nature of the real world and agents' robustness to environmental changes. In this paper, we study a crucial problem: how to evolve the agent environment in a scalable and controllable way, thereby better evaluating agents' adaptability to real-world dynamics. We propose ProEvolve, a graph-based framework that makes environment evolution programmable. At its core, a typed relational graph provides a unified, explicit representation of the environment: data, tools, and schema. Under this formalism, adding, removing, or modifying capabilities are expressed as graph transformations that coherently propagate updates across tools, schemas, and data access. Building on this, ProEvolve can (1) program the evolutionary dynamics as graph transformations to generate environments automatically, and (2) instantiate task sandboxes via subgraph sampling and programming. We validate ProEvolve by evolving a single environment into 200 environments and 3,000 task sandboxes, and benchmark representative agents accordingly.

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks

TL;DR

Abstract

Paper Structure (35 sections, 4 equations, 5 figures, 2 tables)

This paper contains 35 sections, 4 equations, 5 figures, 2 tables.

Introduction
Related Work
From Static to Evolving: Challenges
ProEvolve: Programmable Evolution for Agent Benchmarks
Graph Formalism for Environment Modeling
Programming to Evolve Environment via Graphs
Programming Tasks as Subgraphs
State-Wise User Simulation and Evaluation
State instruction and success criterion.
Progression rule.
State success rate.
Experiments
Benchmark Generation: An E-Commerce Scenario
Experiment Setup
Results
...and 20 more sections

Figures (5)

Figure 1: End-to-end workflow of programmable environment evolution and graph-grounded task instantiation. Environment graphs are evolved via programmable graph edits and translated into executable code (left). Tasks are then generated by sampling subgraphs and materializing state-wise user intents and data into runnable sandbox instances (right), enabling controlled evaluation under evolving environments.
Figure 2: Programmable environment evolution via graph transformations. Starting from a seed environment graph $\mathcal{G}^0$, we generate a curriculum of environments $\mathcal{G}^1,\mathcal{G}^2,\mathcal{G}^3$ by applying explicit edit operators (arrows; e.g., component onboarding, schema/tool updates, and dependency rewiring), which add/remove nodes and edges in a coherent manner. This yields controlled environment dynamics while preserving a unified representation for task generation and evaluation across versions.
Figure 3: Context as subgraph expansion in a tool-mediated conversation. At each turn, the environment exposes a reachable context subgraph (gray nodes; dashed arrows for reachable tool transitions), while the agent activates a subset of nodes (green) by executing tools/actions (solid arrows) conditioned on the dialogue. As the conversation progresses, executed transitions expand the active subgraph, enabling retrieval and integration of newly reachable information (e.g., from User.user_id to User.order to Order.order_items and downstream product attributes).
Figure 4: Performance--efficiency trade-off of replay strategies. Each point corresponds to a model, plotted by average tool calls (x-axis) and success rate / mean completeness (y-axis) over the evolving episode. Faint markers denote the Baseline strategy, while bold markers denote the replay strategy; arrows connect the same model before and after replay. Top: Baseline $\rightarrow$ History Replay. Bottom: Baseline $\rightarrow$ Reflection Replay.
Figure 5: Efficiency breakdown by task difficulty. We report average tool calls, estimated cost, conversation turns, and reward for each model on easy vs. hard tasks. Harder tasks generally require longer trajectories and more tool usage

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks

TL;DR

Abstract

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)