Alita-G: Self-Evolving Generative Agent for Agent Generation
Jiahao Qiu, Xuan Qi, Hongru Wang, Xinzhe Juan, Yimin Wang, Zelin Zhao, Jiayi Geng, Jiacheng Guo, Peihang Li, Jingzhe Shi, Shilong Liu, Mengdi Wang
TL;DR
ALITA-G addresses the challenge of transforming generalist LLM agents into domain-specific experts through task-driven MCP generation, abstraction, and retrieval-augmented tool selection. By harvesting MCPs from successful trajectories and abstracting them into reusable primitives (the MCP Box), the framework enables end-to-end specialization with efficient inference via RAG. Empirical results on GAIA, PathVQA, and Humanity's Last Exam show state-of-the-art GAIA performance (pass@1 $=83.03\%$, pass@3 $=89.09\%$) and substantial compute reductions (about 15% fewer tokens) compared to strong baselines, with triple-generation MCP Boxes offering the best trade-off between coverage and redundancy. Collectively, ALITA-G demonstrates a principled pathway to scalable, domain-focused competence that generalizes across tasks while enhancing both accuracy and efficiency.
Abstract
Large language models (LLMs) have been shown to perform better when scaffolded into agents with memory, tools, and feedback. Beyond this, self-evolving agents have emerged, but current work largely limits adaptation to prompt rewriting or failure retries. Therefore, we present ALITA-G, a self-evolution framework that transforms a general-purpose agent into a domain expert by systematically generating, abstracting, and curating Model Context Protocol (MCP) tools. In this framework, a generalist agent executes a curated suite of target-domain tasks and synthesizes candidate MCPs from successful trajectories. These are then abstracted to parameterized primitives and consolidated into an MCP Box. At inference time, ALITA-G performs retrieval-augmented MCP selection with the help of each tool's descriptions and use cases, before executing an agent equipped with the MCP Executor. Across several benchmarks GAIA, PathVQA, and Humanity's Last Exam, ALITA-G attains strong gains while reducing computation costs. On GAIA validation, it achieves 83.03% pass@1 and 89.09% pass@3, establishing a new state-of-the-art result while reducing mean tokens per example by approximately 15% relative to a strong baseline agent. ALITA-G thus provides a principled pathway from generalist capability to reusable, domain-specific competence, improving both accuracy and efficiency on complex reasoning tasks.
