Table of Contents
Fetching ...

MLZero: A Multi-Agent System for End-to-end Machine Learning Automation

Haoyang Fang, Boran Han, Nick Erickson, Xiyuan Zhang, Su Zhou, Anirudh Dagar, Jiani Zhang, Ali Caner Turkmen, Cuixiong Hu, Huzefa Rangwala, Ying Nian Wu, Bernie Wang, George Karypis

TL;DR

MLZero introduces a hierarchical multi-agent framework that integrates a perception module with semantic and episodic memory alongside an iterative coding loop to achieve end-to-end multimodal AutoML with minimal human input. The approach effectively transforms raw multimodal data into predictions and executable code, formalized as $\mathcal{F}(x, U^{opt}) = (y, C, L)$, and demonstrates state-of-the-art results on MLE-bench Lite and the Multimodal AutoML Agent Benchmark. Ablation studies show that both semantic and episodic memory materially improve code quality, debugging efficiency, and success rates, while enabling robust performance even with an 8B parameter model. The results suggest strong practical potential for automated ML across diverse data modalities, though limitations related to library documentation and larger-model requirements motivate future work on smaller models and expanded tool libraries. Overall, MLZero advances end-to-end ML automation by coupling perception, memory, and iterative execution in a scalable, low-human-effort framework.

Abstract

Existing AutoML systems have advanced the automation of machine learning (ML); however, they still require substantial manual configuration and expert input, particularly when handling multimodal data. We introduce MLZero, a novel multi-agent framework powered by Large Language Models (LLMs) that enables end-to-end ML automation across diverse data modalities with minimal human intervention. A cognitive perception module is first employed, transforming raw multimodal inputs into perceptual context that effectively guides the subsequent workflow. To address key limitations of LLMs, such as hallucinated code generation and outdated API knowledge, we enhance the iterative code generation process with semantic and episodic memory. MLZero demonstrates superior performance on MLE-Bench Lite, outperforming all competitors in both success rate and solution quality, securing six gold medals. Additionally, when evaluated on our Multimodal AutoML Agent Benchmark, which includes 25 more challenging tasks spanning diverse data modalities, MLZero outperforms the competing methods by a large margin with a success rate of 0.92 (+263.6\%) and an average rank of 2.28. Our approach maintains its robust effectiveness even with a compact 8B LLM, outperforming full-size systems from existing solutions.

MLZero: A Multi-Agent System for End-to-end Machine Learning Automation

TL;DR

MLZero introduces a hierarchical multi-agent framework that integrates a perception module with semantic and episodic memory alongside an iterative coding loop to achieve end-to-end multimodal AutoML with minimal human input. The approach effectively transforms raw multimodal data into predictions and executable code, formalized as , and demonstrates state-of-the-art results on MLE-bench Lite and the Multimodal AutoML Agent Benchmark. Ablation studies show that both semantic and episodic memory materially improve code quality, debugging efficiency, and success rates, while enabling robust performance even with an 8B parameter model. The results suggest strong practical potential for automated ML across diverse data modalities, though limitations related to library documentation and larger-model requirements motivate future work on smaller models and expanded tool libraries. Overall, MLZero advances end-to-end ML automation by coupling perception, memory, and iterative execution in a scalable, low-human-effort framework.

Abstract

Existing AutoML systems have advanced the automation of machine learning (ML); however, they still require substantial manual configuration and expert input, particularly when handling multimodal data. We introduce MLZero, a novel multi-agent framework powered by Large Language Models (LLMs) that enables end-to-end ML automation across diverse data modalities with minimal human intervention. A cognitive perception module is first employed, transforming raw multimodal inputs into perceptual context that effectively guides the subsequent workflow. To address key limitations of LLMs, such as hallucinated code generation and outdated API knowledge, we enhance the iterative code generation process with semantic and episodic memory. MLZero demonstrates superior performance on MLE-Bench Lite, outperforming all competitors in both success rate and solution quality, securing six gold medals. Additionally, when evaluated on our Multimodal AutoML Agent Benchmark, which includes 25 more challenging tasks spanning diverse data modalities, MLZero outperforms the competing methods by a large margin with a success rate of 0.92 (+263.6\%) and an average rank of 2.28. Our approach maintains its robust effectiveness even with a compact 8B LLM, outperforming full-size systems from existing solutions.

Paper Structure

This paper contains 78 sections, 8 equations, 4 figures, 17 tables, 1 algorithm.

Figures (4)

  • Figure 1: MLZero: An end-to-end multi-agent system that integrates specialized perception agents with dual memory modules (semantic and episodic) to power iterative coding cycles, transforming raw data into ready-to-use models and prediction outputs with zero human intervention.
  • Figure 2: MLZero: A multi-agent system for end-to-end multimodal ML automation with zero human interventions. During the initialization phase, the perception module selects the appropriate ML library and generates perceptual context to initialize semantic and episodic memory. In the subsequent generation phase, the system performs code generation, execution, and debugging iteratively with the assistance of perceptual context, semantic memory, and episodic memory until successful output is achieved. The surrounding panels detail the four key modules: (1) Perception (upper left) in Section \ref{['method:perception']}; (2) Semantic Memory (upper right) in Section \ref{['method:semantic']}; (3) Episodic Memory (lower right) in Section \ref{['method:episodic']}; and (4) Iterative Coding (lower left) in Section \ref{['method:iterative']}.
  • Figure 3: Comparing our agent with baselines on MLE-bench. Detailed results for each run and each agent are shown in Appendix \ref{['app:mlebench_result_details']}.
  • Figure 4: Ablation study for semantic memory: Impact on system performance and efficiency of different offline indexing settings (top) and retrieval size (0, 1, 3, 5, 10 documents) (bottom).