Table of Contents
Fetching ...

AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song

TL;DR

This work introduces AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents that achieves a low average cost of $0.60 per trajectory, orders of magnitude cheaper than human annotations.

Abstract

We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents. Leveraging information asymmetry, AgentSynth constructs subtasks that are simple during generation but significantly more challenging when composed into long-horizon tasks, enabling the creation of over 6,000 diverse and realistic tasks. A key strength of AgentSynth is its ability to precisely modulate task complexity by varying the number of subtasks. Empirical evaluations show that state-of-the-art LLM agents suffer a steep performance drop, from 18% success at difficulty level 1 to just 4% at level 6, highlighting the benchmark's difficulty and discriminative power. Moreover, our pipeline achieves a low average cost of $0.60 per trajectory, orders of magnitude cheaper than human annotations. Our code and data are available at https://github.com/sunblaze-ucb/AgentSynth

AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

TL;DR

This work introduces AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents that achieves a low average cost of $0.60 per trajectory, orders of magnitude cheaper than human annotations.

Abstract

We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents. Leveraging information asymmetry, AgentSynth constructs subtasks that are simple during generation but significantly more challenging when composed into long-horizon tasks, enabling the creation of over 6,000 diverse and realistic tasks. A key strength of AgentSynth is its ability to precisely modulate task complexity by varying the number of subtasks. Empirical evaluations show that state-of-the-art LLM agents suffer a steep performance drop, from 18% success at difficulty level 1 to just 4% at level 6, highlighting the benchmark's difficulty and discriminative power. Moreover, our pipeline achieves a low average cost of $0.60 per trajectory, orders of magnitude cheaper than human annotations. Our code and data are available at https://github.com/sunblaze-ucb/AgentSynth

Paper Structure

This paper contains 23 sections, 9 figures, 20 tables.

Figures (9)

  • Figure 1: AgentSynth data generation pipeline. Given a persona, the task proposer generates an initial task, which is followed by a sequence of subtasks executed by the agent. Each step is verified; if execution fails, a revised subtask description is generated. After $n$ successful steps, a summarization agent composes final high-level tasks. Tasks at different difficulty levels are formed by summarizing the first $1$ to $n$ subtasks, enabling controllable task complexity.
  • Figure 2: Verifier calibration.
  • Figure 3: AgentSynth dataset statistics.
  • Figure 4: Model performance across task difficulty levels.
  • Figure 5: Model performance for bare LLMs and with Agent S3 scaffolding.
  • ...and 4 more figures