Table of Contents
Fetching ...

ArchPilot: A Proxy-Guided Multi-Agent Approach for Machine Learning Engineering

Zhuowen Yuan, Tao Liu, Yang Yang, Yang Wang, Feng Qi, Kaushik Rangadurai, Bo Li, Shuang Yang

TL;DR

The paper tackles the high cost of evaluating candidate ML architectures in LLM-driven AutoML by introducing ArchPilot, a three-agent NAS framework that decouples generation, evaluation, and orchestration. It combines a restart-enabled MCTS-based orchestration agent with a generation agent that produces runnable pipelines and an evaluation agent that uses a multi-proxy evaluation pipeline, including ridge-regularized weight fitting and a hard-zero policy, to guide search. Key contributions include modular agent separation, a principled multi-proxy scoring and weight-fitting mechanism, and a restart strategy to maintain alignment with updated evaluation signals. Experiments on MLE-Bench show ArchPilot outperforms SOTA baselines such as AIDE and ML-Master, achieving higher valid submission rates and better leaderboard rankings under constrained compute, demonstrating scalable, cost-efficient ML engineering with interpretable search dynamics.

Abstract

Recent LLM-based agents have demonstrated strong capabilities in automated ML engineering. However, they heavily rely on repeated full training runs to evaluate candidate solutions, resulting in significant computational overhead, limited scalability to large search spaces, and slow iteration cycles. To address these challenges, we introduce ArchPilot, a multi-agent system that integrates architecture generation, proxy-based evaluation, and adaptive search into a unified framework. ArchPilot consists of three specialized agents: an orchestration agent that coordinates the search process using a Monte Carlo Tree Search (MCTS)-inspired novel algorithm with a restart mechanism and manages memory of previous candidates; a generation agent that iteratively generates, improves, and debugs candidate architectures; and an evaluation agent that executes proxy training runs, generates and optimizes proxy functions, and aggregates the proxy scores into a fidelity-aware performance metric. This multi-agent collaboration allows ArchPilot to prioritize high-potential candidates with minimal reliance on expensive full training runs, facilitating efficient ML engineering under limited budgets. Experiments on MLE-Bench demonstrate that ArchPilot outperforms SOTA baselines such as AIDE and ML-Master, validating the effectiveness of our multi-agent system.

ArchPilot: A Proxy-Guided Multi-Agent Approach for Machine Learning Engineering

TL;DR

The paper tackles the high cost of evaluating candidate ML architectures in LLM-driven AutoML by introducing ArchPilot, a three-agent NAS framework that decouples generation, evaluation, and orchestration. It combines a restart-enabled MCTS-based orchestration agent with a generation agent that produces runnable pipelines and an evaluation agent that uses a multi-proxy evaluation pipeline, including ridge-regularized weight fitting and a hard-zero policy, to guide search. Key contributions include modular agent separation, a principled multi-proxy scoring and weight-fitting mechanism, and a restart strategy to maintain alignment with updated evaluation signals. Experiments on MLE-Bench show ArchPilot outperforms SOTA baselines such as AIDE and ML-Master, achieving higher valid submission rates and better leaderboard rankings under constrained compute, demonstrating scalable, cost-efficient ML engineering with interpretable search dynamics.

Abstract

Recent LLM-based agents have demonstrated strong capabilities in automated ML engineering. However, they heavily rely on repeated full training runs to evaluate candidate solutions, resulting in significant computational overhead, limited scalability to large search spaces, and slow iteration cycles. To address these challenges, we introduce ArchPilot, a multi-agent system that integrates architecture generation, proxy-based evaluation, and adaptive search into a unified framework. ArchPilot consists of three specialized agents: an orchestration agent that coordinates the search process using a Monte Carlo Tree Search (MCTS)-inspired novel algorithm with a restart mechanism and manages memory of previous candidates; a generation agent that iteratively generates, improves, and debugs candidate architectures; and an evaluation agent that executes proxy training runs, generates and optimizes proxy functions, and aggregates the proxy scores into a fidelity-aware performance metric. This multi-agent collaboration allows ArchPilot to prioritize high-potential candidates with minimal reliance on expensive full training runs, facilitating efficient ML engineering under limited budgets. Experiments on MLE-Bench demonstrate that ArchPilot outperforms SOTA baselines such as AIDE and ML-Master, validating the effectiveness of our multi-agent system.

Paper Structure

This paper contains 27 sections, 9 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of ArchPilot. The Orchestration Agent (OA) selects candidate nodes using MCTS, maintains memory, and coordinates the search process. The selected node, together with its context, is passed to the Generation Agent (GA), which drafts, debugs, or improves training scripts. The Evaluation Agent (EA) then executes proxy training or full training, producing proxy vectors, aggregated scores, and optional true metrics.
  • Figure 2: Performance vs. GPU budget across difficulty levels. Mean normalized ranking (lower is better) as a function of available GPU hours per task. ArchPilot achieves better (lower) ranking scores across low-, medium-, high-difficulty, and overall tasks.