Table of Contents
Fetching ...

Towards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm

Dadi Guo, Tianyi Zhou, Dongrui Liu, Chen Qian, Qihan Ren, Shuai Shao, Zhiyuan Fan, Yi R. Fung, Kun Wang, Linfeng Zhang, Jing Shao

TL;DR

This paper addresses the rapid saturation of agent benchmarks by introducing TRACE, a self-evolving benchmark framework that generates harder tasks via test-time exploration while recording complete, validatable execution trajectories. It partitions the evolution process into three stages (proposal mining, exploration, and validation) and deploys three cooperative agents (Evolutionary Proposer, Exploration Executor, Trajectory Validator) to produce pairs of evolved tasks and their traces. Empirical results on GAIA (and AIME-2024) show that evolved tasks reliably reduce model performance, indicating genuine increases in difficulty and diversity, including domain shifts beyond simple difficulty scaling. TRACE thus enables a scalable, auditable, and reproducible evolution of benchmarks, supporting sustained progress in evaluating autonomous agent capabilities across diverse domains.

Abstract

Recent advances in large language models (LLMs) and agent system designs have empowered agents with unprecedented levels of capability. However, existing agent benchmarks are showing a trend of rapid ceiling-hitting by newly developed agents, making it difficult to meet the demands for evaluating agent abilities. To address this problem, we propose the Trajectory-based Validated-by-Reproducing Agent-benchmark Complexity Evolution (TRACE) framework. This framework takes an original task from an existing benchmark and encourages agents to freely explore and evolve it into a new task with higher difficulty while recording validatable agent trajectories. The framework proceeds in three stages: (1) evolutionary proposal mining, which provides task evolution proposals through preliminary exploration and divergent thinking; (2) problem formation and free exploration, where proposals are conceptualized into feasible problem candidates and the agents then explore them freely while recording their execution trajectories; and (3) multi-level validation, which ensures that the evolved tasks are accompanied by validatable and reproducible trajectories. Experiments on the GAIA benchmark demonstrate that the TRACE framework consistently enhances task complexity while improving the reliability of correctness through validatable execution trajectories. In addition, our framework can successfully adapt to and improve reasoning datasets represented by AIME-2024. This work marks a paradigm shift from static, manually curated benchmarks to dynamic, self-evolving evaluation systems, providing a sustainable and challenging runway for agent development

Towards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm

TL;DR

This paper addresses the rapid saturation of agent benchmarks by introducing TRACE, a self-evolving benchmark framework that generates harder tasks via test-time exploration while recording complete, validatable execution trajectories. It partitions the evolution process into three stages (proposal mining, exploration, and validation) and deploys three cooperative agents (Evolutionary Proposer, Exploration Executor, Trajectory Validator) to produce pairs of evolved tasks and their traces. Empirical results on GAIA (and AIME-2024) show that evolved tasks reliably reduce model performance, indicating genuine increases in difficulty and diversity, including domain shifts beyond simple difficulty scaling. TRACE thus enables a scalable, auditable, and reproducible evolution of benchmarks, supporting sustained progress in evaluating autonomous agent capabilities across diverse domains.

Abstract

Recent advances in large language models (LLMs) and agent system designs have empowered agents with unprecedented levels of capability. However, existing agent benchmarks are showing a trend of rapid ceiling-hitting by newly developed agents, making it difficult to meet the demands for evaluating agent abilities. To address this problem, we propose the Trajectory-based Validated-by-Reproducing Agent-benchmark Complexity Evolution (TRACE) framework. This framework takes an original task from an existing benchmark and encourages agents to freely explore and evolve it into a new task with higher difficulty while recording validatable agent trajectories. The framework proceeds in three stages: (1) evolutionary proposal mining, which provides task evolution proposals through preliminary exploration and divergent thinking; (2) problem formation and free exploration, where proposals are conceptualized into feasible problem candidates and the agents then explore them freely while recording their execution trajectories; and (3) multi-level validation, which ensures that the evolved tasks are accompanied by validatable and reproducible trajectories. Experiments on the GAIA benchmark demonstrate that the TRACE framework consistently enhances task complexity while improving the reliability of correctness through validatable execution trajectories. In addition, our framework can successfully adapt to and improve reasoning datasets represented by AIME-2024. This work marks a paradigm shift from static, manually curated benchmarks to dynamic, self-evolving evaluation systems, providing a sustainable and challenging runway for agent development

Paper Structure

This paper contains 33 sections, 4 equations, 17 figures, 4 tables, 2 algorithms.

Figures (17)

  • Figure 1: Model performance comparison on the Pass@1 metric across four distinct difficulty levels and evolution rounds under the TRACE framework. As the number of evolution rounds increases, the performance of models shows a downward trend, demonstrating that our framework successfully evolves more challenging tasks.
  • Figure 2: TRACE evolution pipeline. Starting from a GAIA Original Task, the Evolution Proposer conducts bottleneck analysis and pre-exploration, drafting a concrete proposal to increase difficulty. Crucially, the Evolution Executorconstructs the evolved problem from its own trajectory: as it runs ReAct (Thought$\rightarrow$Action$\rightarrow$Observation), it collects evidence (numbers, constraints, citations, etc.) and uses this trajectory to parameterize and scaffold the new task, while simultaneously producing a complete solution trace. A Multi-Level Validator then applies lightweight schema checks, dynamic replay for reproducibility, and solvability/logic audits to ensure trace validity. The result is an Evolved Task that preserves origina benchmark’s interface yet requires deeper reasoning (math + coding), achieving a systematic benchmark-level difficulty increase.
  • Figure 3: The system prompt of our Evolution Proposer agent.
  • Figure 4: The bottleneck demonstrations in the system prompt of our Evolution Proposer agent.
  • Figure 5: The system prompt of our Exploration Executor agent.
  • ...and 12 more figures