Table of Contents
Fetching ...

Alita-G: Self-Evolving Generative Agent for Agent Generation

Jiahao Qiu, Xuan Qi, Hongru Wang, Xinzhe Juan, Yimin Wang, Zelin Zhao, Jiayi Geng, Jiacheng Guo, Peihang Li, Jingzhe Shi, Shilong Liu, Mengdi Wang

TL;DR

ALITA-G addresses the challenge of transforming generalist LLM agents into domain-specific experts through task-driven MCP generation, abstraction, and retrieval-augmented tool selection. By harvesting MCPs from successful trajectories and abstracting them into reusable primitives (the MCP Box), the framework enables end-to-end specialization with efficient inference via RAG. Empirical results on GAIA, PathVQA, and Humanity's Last Exam show state-of-the-art GAIA performance (pass@1 $=83.03\%$, pass@3 $=89.09\%$) and substantial compute reductions (about 15% fewer tokens) compared to strong baselines, with triple-generation MCP Boxes offering the best trade-off between coverage and redundancy. Collectively, ALITA-G demonstrates a principled pathway to scalable, domain-focused competence that generalizes across tasks while enhancing both accuracy and efficiency.

Abstract

Large language models (LLMs) have been shown to perform better when scaffolded into agents with memory, tools, and feedback. Beyond this, self-evolving agents have emerged, but current work largely limits adaptation to prompt rewriting or failure retries. Therefore, we present ALITA-G, a self-evolution framework that transforms a general-purpose agent into a domain expert by systematically generating, abstracting, and curating Model Context Protocol (MCP) tools. In this framework, a generalist agent executes a curated suite of target-domain tasks and synthesizes candidate MCPs from successful trajectories. These are then abstracted to parameterized primitives and consolidated into an MCP Box. At inference time, ALITA-G performs retrieval-augmented MCP selection with the help of each tool's descriptions and use cases, before executing an agent equipped with the MCP Executor. Across several benchmarks GAIA, PathVQA, and Humanity's Last Exam, ALITA-G attains strong gains while reducing computation costs. On GAIA validation, it achieves 83.03% pass@1 and 89.09% pass@3, establishing a new state-of-the-art result while reducing mean tokens per example by approximately 15% relative to a strong baseline agent. ALITA-G thus provides a principled pathway from generalist capability to reusable, domain-specific competence, improving both accuracy and efficiency on complex reasoning tasks.

Alita-G: Self-Evolving Generative Agent for Agent Generation

TL;DR

ALITA-G addresses the challenge of transforming generalist LLM agents into domain-specific experts through task-driven MCP generation, abstraction, and retrieval-augmented tool selection. By harvesting MCPs from successful trajectories and abstracting them into reusable primitives (the MCP Box), the framework enables end-to-end specialization with efficient inference via RAG. Empirical results on GAIA, PathVQA, and Humanity's Last Exam show state-of-the-art GAIA performance (pass@1 , pass@3 ) and substantial compute reductions (about 15% fewer tokens) compared to strong baselines, with triple-generation MCP Boxes offering the best trade-off between coverage and redundancy. Collectively, ALITA-G demonstrates a principled pathway to scalable, domain-focused competence that generalizes across tasks while enhancing both accuracy and efficiency.

Abstract

Large language models (LLMs) have been shown to perform better when scaffolded into agents with memory, tools, and feedback. Beyond this, self-evolving agents have emerged, but current work largely limits adaptation to prompt rewriting or failure retries. Therefore, we present ALITA-G, a self-evolution framework that transforms a general-purpose agent into a domain expert by systematically generating, abstracting, and curating Model Context Protocol (MCP) tools. In this framework, a generalist agent executes a curated suite of target-domain tasks and synthesizes candidate MCPs from successful trajectories. These are then abstracted to parameterized primitives and consolidated into an MCP Box. At inference time, ALITA-G performs retrieval-augmented MCP selection with the help of each tool's descriptions and use cases, before executing an agent equipped with the MCP Executor. Across several benchmarks GAIA, PathVQA, and Humanity's Last Exam, ALITA-G attains strong gains while reducing computation costs. On GAIA validation, it achieves 83.03% pass@1 and 89.09% pass@3, establishing a new state-of-the-art result while reducing mean tokens per example by approximately 15% relative to a strong baseline agent. ALITA-G thus provides a principled pathway from generalist capability to reusable, domain-specific competence, improving both accuracy and efficiency on complex reasoning tasks.

Paper Structure

This paper contains 35 sections, 12 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overall workflow of Alita-G. The process begins with task-driven MCP generation, where a Master Agent repeatedly executes target tasks and distills a pool of raw MCPs from successful trajectories. These MCPs are then abstracted and refined through parameter generalization, context removal, interface standardization, and documentation enhancement to form a reusable MCP Box. At inference time, the MCP Box supports RAG-enhanced tool selection: user queries are matched against MCP descriptions, and threshold/top-$k$ filtering yields a contextually relevant set of MCPs. Finally, a specialized agent—comprising a Manager Agent with a Task Analyzer, MCP Retriever, and MCP Executor—runs a CodeAct loop to retrieve and invoke the selected MCPs, thereby transforming a general-purpose agent into a domain specialist for end-to-end task solving.
  • Figure 2: MCP generation and abstraction.Left: A raw MCP emerges during execution to extract measurements from scientific PDFs in response to a concrete task. Right: The MCP is abstracted, where hard-coded values are lifted into parameters, interfaces are standardized to FastMCP, and documentation is enhanced, yielding a reusable tool suitable for retrieval and reuse across tasks.
  • Figure 3: Effect of the MCP Box at inference.Baseline agent (no MCP Box): fails to obtain precise thermodynamic properties and answers incorrectly (20 mL). Specialized agent (with MCP Box): retrieves the abstracted extract_pdf_measurement via RAG, extracts the needed properties, and answers correctly (55 mL). The example underscores how abstraction plus MCP-level retrieval converts transient problem-solving into reusable competence that boosts downstream performance.
  • Figure : Specialized Agent Inference