ArchPilot: A Proxy-Guided Multi-Agent Approach for Machine Learning Engineering
Zhuowen Yuan, Tao Liu, Yang Yang, Yang Wang, Feng Qi, Kaushik Rangadurai, Bo Li, Shuang Yang
TL;DR
The paper tackles the high cost of evaluating candidate ML architectures in LLM-driven AutoML by introducing ArchPilot, a three-agent NAS framework that decouples generation, evaluation, and orchestration. It combines a restart-enabled MCTS-based orchestration agent with a generation agent that produces runnable pipelines and an evaluation agent that uses a multi-proxy evaluation pipeline, including ridge-regularized weight fitting and a hard-zero policy, to guide search. Key contributions include modular agent separation, a principled multi-proxy scoring and weight-fitting mechanism, and a restart strategy to maintain alignment with updated evaluation signals. Experiments on MLE-Bench show ArchPilot outperforms SOTA baselines such as AIDE and ML-Master, achieving higher valid submission rates and better leaderboard rankings under constrained compute, demonstrating scalable, cost-efficient ML engineering with interpretable search dynamics.
Abstract
Recent LLM-based agents have demonstrated strong capabilities in automated ML engineering. However, they heavily rely on repeated full training runs to evaluate candidate solutions, resulting in significant computational overhead, limited scalability to large search spaces, and slow iteration cycles. To address these challenges, we introduce ArchPilot, a multi-agent system that integrates architecture generation, proxy-based evaluation, and adaptive search into a unified framework. ArchPilot consists of three specialized agents: an orchestration agent that coordinates the search process using a Monte Carlo Tree Search (MCTS)-inspired novel algorithm with a restart mechanism and manages memory of previous candidates; a generation agent that iteratively generates, improves, and debugs candidate architectures; and an evaluation agent that executes proxy training runs, generates and optimizes proxy functions, and aggregates the proxy scores into a fidelity-aware performance metric. This multi-agent collaboration allows ArchPilot to prioritize high-potential candidates with minimal reliance on expensive full training runs, facilitating efficient ML engineering under limited budgets. Experiments on MLE-Bench demonstrate that ArchPilot outperforms SOTA baselines such as AIDE and ML-Master, validating the effectiveness of our multi-agent system.
