Table of Contents
Fetching ...

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, Huaxiu Yao

TL;DR

Agent0 tackles the data bottleneck in LLM agents by enabling fully autonomous, zero-data co-evolution between a Curriculum Agent and an Executor Agent, enhanced with a code interpreter tool. The curriculum generates frontier tasks while the executor solves them, forming a self-reinforcing loop that continually raises task difficulty as capabilities improve. Key innovations include GRPO-based curriculum optimization, Ambiguity-Dynamic Policy Optimization to handle pseudo-label noise, and multi-turn tool-augmented reasoning that expands problem-solving horizons. Empirically, Agent0 substantially boosts reasoning on Qwen3-8B-Base (18% math, 24% general) and demonstrates strong generalization, highlighting a scalable pathway for self-improving, tool-enabled agents without external data.

Abstract

Large Language Model (LLM) Agents, often trained with Reinforcement Learning (RL), are constrained by a dependency on human-curated data, limiting scalability and tethering AI to human knowledge. Existing self-evolution frameworks offer an alternative but are typically restricted by the model's inherent capabilities and single-round interactions, hindering the development of complex curricula involving tool use or dynamic reasoning. We introduce Agent0, a fully autonomous framework that evolves high-performing agents without external data through multi-step co-evolution and seamless tool integration. Agent0 establishes a symbiotic competition between two agents initialized from the same base LLM: a curriculum agent that proposes increasingly challenging frontier tasks, and an executor agent that learns to solve them. We integrate external tools to enhance the executor's problem-solving capacity; this improvement, in turn, pressures the curriculum agent to construct more complex, tool-aware tasks. Through this iterative process, Agent0 establishes a self-reinforcing cycle that continuously produces high-quality curricula. Empirically, Agent0 substantially boosts reasoning capabilities, improving the Qwen3-8B-Base model by 18% on mathematical reasoning and 24% on general reasoning benchmarks. Code is available at https://github.com/aiming-lab/Agent0.

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

TL;DR

Agent0 tackles the data bottleneck in LLM agents by enabling fully autonomous, zero-data co-evolution between a Curriculum Agent and an Executor Agent, enhanced with a code interpreter tool. The curriculum generates frontier tasks while the executor solves them, forming a self-reinforcing loop that continually raises task difficulty as capabilities improve. Key innovations include GRPO-based curriculum optimization, Ambiguity-Dynamic Policy Optimization to handle pseudo-label noise, and multi-turn tool-augmented reasoning that expands problem-solving horizons. Empirically, Agent0 substantially boosts reasoning on Qwen3-8B-Base (18% math, 24% general) and demonstrates strong generalization, highlighting a scalable pathway for self-improving, tool-enabled agents without external data.

Abstract

Large Language Model (LLM) Agents, often trained with Reinforcement Learning (RL), are constrained by a dependency on human-curated data, limiting scalability and tethering AI to human knowledge. Existing self-evolution frameworks offer an alternative but are typically restricted by the model's inherent capabilities and single-round interactions, hindering the development of complex curricula involving tool use or dynamic reasoning. We introduce Agent0, a fully autonomous framework that evolves high-performing agents without external data through multi-step co-evolution and seamless tool integration. Agent0 establishes a symbiotic competition between two agents initialized from the same base LLM: a curriculum agent that proposes increasingly challenging frontier tasks, and an executor agent that learns to solve them. We integrate external tools to enhance the executor's problem-solving capacity; this improvement, in turn, pressures the curriculum agent to construct more complex, tool-aware tasks. Through this iterative process, Agent0 establishes a self-reinforcing cycle that continuously produces high-quality curricula. Empirically, Agent0 substantially boosts reasoning capabilities, improving the Qwen3-8B-Base model by 18% on mathematical reasoning and 24% on general reasoning benchmarks. Code is available at https://github.com/aiming-lab/Agent0.

Paper Structure

This paper contains 24 sections, 10 equations, 5 figures, 20 tables, 1 algorithm.

Figures (5)

  • Figure 1: The $\texttt{Agent0}$ autonomous co-evolution framework. The Curriculum Agent (left) uses RL to generate frontier tasks, rewarded by the Executor Agent's uncertainty and tool-use frequency. The Executor Agent (right) learn to solve them by RL. This shared tool integration drives a virtuous cycle, spiraling up task complexity and agent capability entirely from scratch.
  • Figure 2: The $\texttt{Agent0}$ co-evolutionary loop. (1) Curriculum Evolution: The Curriculum Agent $\pi_{\theta}$ is trained via RL to generate tasks, maximizing a reward $R_C$ based on executor Uncertainty $R_{\text{unc}}$, Tool Use $R_{\text{tool}}$ and Repetition Penalty $R_\text{rep}$. (2) Executor Evolution: Tasks are filtered by self-consistency score $\hat{p}$ to create a challenging dataset $\mathcal{D}^{(t)}$. The Executor Agent $\pi_{\phi}$ is then trained on $\mathcal{D}^{(t)}$ via ADPO, an ambiguity-aware RL method using majority-vote pseudo-labels $\tilde{y}$.
  • Figure 3: Up-clipped token probabilities. Most up-clipped tokens have low probabilities, implying standard clipping limits exploration.
  • Figure 4: Performance on mathematical and general reasoning benchmarks, showing consistent improvement for both Qwen3-4B and Qwen3-8B across three co-evolutionary iterations.
  • Figure 5: Qualitative Case Analysis. Left: Examples of generated questions showing a clear increase in complexity and diversity from Iter 1 to Iter 3. Right: A demonstration of Agent0's solving process, utilizing a hybrid approach of mathematical reasoning and Python code execution to solve a standard MATH problem.