MLZero: A Multi-Agent System for End-to-end Machine Learning Automation
Haoyang Fang, Boran Han, Nick Erickson, Xiyuan Zhang, Su Zhou, Anirudh Dagar, Jiani Zhang, Ali Caner Turkmen, Cuixiong Hu, Huzefa Rangwala, Ying Nian Wu, Bernie Wang, George Karypis
TL;DR
MLZero introduces a hierarchical multi-agent framework that integrates a perception module with semantic and episodic memory alongside an iterative coding loop to achieve end-to-end multimodal AutoML with minimal human input. The approach effectively transforms raw multimodal data into predictions and executable code, formalized as $\mathcal{F}(x, U^{opt}) = (y, C, L)$, and demonstrates state-of-the-art results on MLE-bench Lite and the Multimodal AutoML Agent Benchmark. Ablation studies show that both semantic and episodic memory materially improve code quality, debugging efficiency, and success rates, while enabling robust performance even with an 8B parameter model. The results suggest strong practical potential for automated ML across diverse data modalities, though limitations related to library documentation and larger-model requirements motivate future work on smaller models and expanded tool libraries. Overall, MLZero advances end-to-end ML automation by coupling perception, memory, and iterative execution in a scalable, low-human-effort framework.
Abstract
Existing AutoML systems have advanced the automation of machine learning (ML); however, they still require substantial manual configuration and expert input, particularly when handling multimodal data. We introduce MLZero, a novel multi-agent framework powered by Large Language Models (LLMs) that enables end-to-end ML automation across diverse data modalities with minimal human intervention. A cognitive perception module is first employed, transforming raw multimodal inputs into perceptual context that effectively guides the subsequent workflow. To address key limitations of LLMs, such as hallucinated code generation and outdated API knowledge, we enhance the iterative code generation process with semantic and episodic memory. MLZero demonstrates superior performance on MLE-Bench Lite, outperforming all competitors in both success rate and solution quality, securing six gold medals. Additionally, when evaluated on our Multimodal AutoML Agent Benchmark, which includes 25 more challenging tasks spanning diverse data modalities, MLZero outperforms the competing methods by a large margin with a success rate of 0.92 (+263.6\%) and an average rank of 2.28. Our approach maintains its robust effectiveness even with a compact 8B LLM, outperforming full-size systems from existing solutions.
